Closed aivanise closed 3 years ago
Yeah, cloud-init only deals with NetworkManager. We do however have daily tests for all our images which includes confirming that the network comes up online.
So this suggests an environment which is making NetworkManager unhappy. It's unfortunately known that NetworkManager doesn't deal well with macvlan, so if you're using macvlan, that's going to be a problem. If using normal bridging, then the issue may be with the version of the kernel you're running.
Hi all, network does not work w/o NetworkManager either...
For centos/8/default:
Regards.
I'm using veth and normal bridging, and it was all working until the switch from network-scripts (https://github.com/lxc/lxc-ci/commit/f951837e5b62c15c9eb482691dc45c12e848afc8) NetworkManager was made. I'm also running the latest 8.4 RHEL kernel, but it is also failing in 8.3
Can you please point me to how the containers are created in your daily tests, maybe that will give me some clues.
stgraber@castiana:~$ lxc launch images:centos/8/cloud c1
Creating c1
Starting c1
stgraber@castiana:~$ lxc launch images:centos/8 c2
Creating c2
Starting c2
stgraber@castiana:~$ lxc list c1
+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| c1 | RUNNING | 10.166.11.32 (eth0) | fd42:4c81:5770:1eaf:216:3eff:fe42:485a (eth0) | CONTAINER | 0 |
+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
stgraber@castiana:~$ lxc list c2
+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| c2 | RUNNING | 10.166.11.60 (eth0) | fd42:4c81:5770:1eaf:216:3eff:fe47:3dbd (eth0) | CONTAINER | 0 |
+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
That's pretty much all we do in our tests.
is there anything special in the default profile? host networking is bridged, I presume?
Nothing special, normal lxdbr0 bridge setup.
Hi @stgraber,
I have tried:
:; lxc launch --console --ephemeral images:centos/8/cloud int9049
...
[1422975.484197] cloud-init[530]: Cloud-init v. 20.3-10.el8_4.2 running 'init-local' at Thu, 01 Jul 2021 06:59:13 +0000. Up 2.26 seconds.
[ OK ] Started Initial cloud-init job (pre-networking).
[ OK ] Reached target Network (Pre).
Starting Network Manager...
[ OK ] Started Network Manager.
Starting Network Manager Wait Online...
[ OK ] Reached target Network.
Starting Network Manager Script Dispatcher Service...
[ OK ] Started Network Manager Script Dispatcher Service.
[FAILED] Failed to start Network Manager Wait Online.
See 'systemctl status NetworkManager-wait-online.service' for details.
Starting Activate connection...
Starting Initial cloud-init job (metadata service crawler)...
[ OK ] Started Activate connection.
[1423036.074977] cloud-init[566]: Cloud-init v. 20.3-10.el8_4.2 running 'init' at Thu, 01 Jul 2021 07:00:14 +0000. Up 63.01 seconds.
[1423036.075096] cloud-init[566]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
[1423036.075159] cloud-init[566]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[1423036.075202] cloud-init[566]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
[1423036.075252] cloud-init[566]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[1423036.075296] cloud-init[566]: ci-info: | eth0 | False | . | . | . | 00:16:3e:59:63:19 |
[1423036.075340] cloud-init[566]: ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
[1423036.075381] cloud-init[566]: ci-info: | lo | True | ::1/128 | . | host | . |
[1423036.075423] cloud-init[566]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[1423036.075469] cloud-init[566]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[1423036.075512] cloud-init[566]: ci-info: +-------+-------------+---------+-----------+-------+
[1423036.075556] cloud-init[566]: ci-info: | Route | Destination | Gateway | Interface | Flags |
[1423036.075602] cloud-init[566]: ci-info: +-------+-------------+---------+-----------+-------+
[1423036.075643] cloud-init[566]: ci-info: +-------+-------------+---------+-----------+-------+
...
CentOS Linux 8
Kernel 4.18.0-240.22.1.el8_3.x86_64 on an x86_64
and:
:; lxc list int9049 --columns=ns46t
+---------+---------+------+------+-----------------------+
| NAME | STATE | IPV4 | IPV6 | TYPE |
+---------+---------+------+------+-----------------------+
| int9049 | RUNNING | | | CONTAINER (EPHEMERAL) |
+---------+---------+------+------+-----------------------+
:; lxc exec int9049 -- bash
[root@int9049 ~]# journalctl -u NetworkManager-wait-online.service
-- Logs begin at Thu 2021-07-01 06:59:13 UTC, end at Thu 2021-07-01 07:01:01 UTC. --
Jul 01 06:59:14 int9049 systemd[1]: NetworkManager-wait-online.service: Failed to reset devices.list: Operation not permitted
Jul 01 06:59:14 int9049 systemd[1]: Starting Network Manager Wait Online...
Jul 01 07:00:14 int9049 systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Jul 01 07:00:14 int9049 systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
Jul 01 07:00:14 int9049 systemd[1]: Failed to start Network Manager Wait Online.
So, no image modification were done, but still no network (above service has failed). Regards
Can you look at /sys/class/net in your container? I'm guessing the issue may be that your kernel is too old to have properly namespaces network interface events which then prevents NetworkManager from properly detecting the interface.
Hi @stgraber ,
here you are:
:; lxc launch --console --ephemeral images:centos/8/cloud int9049
...
and
:; lxc exec int9049 -- bash
[root@int9049 ~]# ls -al /sys/class/net
total 0
drwxr-xr-x 2 nobody nobody 0 Jul 2 08:29 .
drwxr-xr-x 57 nobody nobody 0 Jul 2 08:29 ..
lrwxrwxrwx 1 nobody nobody 0 Jul 2 08:29 eth0 -> ../../devices/virtual/net/eth0
lrwxrwxrwx 1 nobody nobody 0 Jul 2 08:29 lo -> ../../devices/virtual/net/lo
[root@int9049 ~]# ls -aLl /sys/class/net
total 0
drwxr-xr-x 2 nobody nobody 0 Jul 2 08:29 .
drwxr-xr-x 57 nobody nobody 0 Jul 2 08:29 ..
drwxr-xr-x 5 nobody nobody 0 Jul 2 08:29 eth0
drwxr-xr-x 5 nobody nobody 0 Jul 2 08:29 lo
[root@int9049 ~]# ls -al /sys/class/net/eth0/
total 0
drwxr-xr-x 5 nobody nobody 0 Jul 2 08:29 .
drwxr-xr-x 52 nobody nobody 0 Jul 2 08:29 ..
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:29 addr_assign_type
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 addr_len
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:29 address
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 broadcast
-rw-r--r-- 1 nobody nobody 4096 Jul 2 08:31 carrier
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 carrier_changes
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 carrier_down_count
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 carrier_up_count
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:30 dev_id
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 dev_port
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 dormant
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 duplex
-rw-r--r-- 1 nobody nobody 4096 Jul 2 08:31 flags
-rw-r--r-- 1 nobody nobody 4096 Jul 2 08:31 gro_flush_timeout
-rw-r--r-- 1 nobody nobody 4096 Jul 2 08:31 ifalias
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:30 ifindex
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 iflink
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 link_mode
-rw-r--r-- 1 nobody nobody 4096 Jul 2 08:31 mtu
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 name_assign_type
-rw-r--r-- 1 nobody nobody 4096 Jul 2 08:31 netdev_group
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 operstate
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:30 phys_port_id
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 phys_port_name
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 phys_switch_id
drwxr-xr-x 2 nobody nobody 0 Jul 2 08:31 power
-rw-r--r-- 1 nobody nobody 4096 Jul 2 08:31 proto_down
drwxr-xr-x 4 nobody nobody 0 Jul 2 08:31 queues
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:31 speed
drwxr-xr-x 2 nobody nobody 0 Jul 2 08:31 statistics
lrwxrwxrwx 1 nobody nobody 0 Jul 2 08:29 subsystem -> ../../../../class/net
-rw-r--r-- 1 nobody nobody 4096 Jul 2 08:31 tx_queue_len
-r--r--r-- 1 nobody nobody 4096 Jul 2 08:29 type
-rw-r--r-- 1 nobody nobody 4096 Jul 2 08:29 uevent
Regarding kernel: cluster members run following versions:
NOTE: no network in any case (e.g when using --target=MEMBER
option for launch command on either cluster member).
Regards.
Thanks for that, so it unfortunately shows the problem. The RHEL kernel is lacking support for proper ownership of network interfaces in sysfs. This is work @brauner did quite a while back which more modern distros have in their kernel.
@brauner I'm only finding https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ef6a4c88e9e11bc32cd02b052d04745af9691412 which is more recent than I remembered, was there something else earlier than that?
If Christian can give you the needed commits, it may be possible for you to file a bug against the RHEL kernel and have Red Hat integrate the needed fixes so this all works out of the box.
Otherwise, Network Manager is probably not going to work for you, so your options are pretty much to manually configure networking using dhclient
or ip
, then install another network management tool and use that instead.
There's not really anything we can do on our side about this. For cloud images, cloud-init requires NetworkManager and as you've noticed, NetworkManager is a bit picky about its interfaces...
I've tried applying the patch but i get a compile error, so it is definitely not everything. Can someone please give me the complete patch so that I can chase it with RedHat, if possible? @brauner ?
net/core/dev.c: In function 'dev_change_net_namespace':
net/core/dev.c:10180:8: error: implicit declaration of function 'netdev_change_owner'; did you mean 'netdev_change_features'? [-Werror=implicit-function-declaration]
err = netdev_change_owner(dev, net_old, net);
^~~~~~~
netdev_change_features
cc1: all warnings being treated as errors
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f70ce185687bbe4e2d7ff126a8c890631f5fc2af https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0666a3aee762cd4f7981c2eed0fd8cab87533539 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=303a42769c4c4d8e5e3ad928df87eb36f8c1fa60 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2c4f9401ceb00167a3bfd322a28aa87b646a253f https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8f33e5d76a7a1b87e0cc760d05bf2477b4e91d6 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3b52fc5d7876a312e6a964d7e626ba05ab1ea6b2 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e6dee9f3893c823dff9c7f33fe0a598ee25c78f7 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d755407d4444c3e0fbd7d7c3aa666d595e9ab217
Thank you, it actually applies fairly cleanly to the current RHEL8 kernel, I have filed a bug with redhat to try to convince them to backport it, let's see
cloud-init variant of centos8 does not initialize networking
it does not install network-scripts package any more and relies solely on NetworkManager, but that fails with "No suitable device found for this connection (device lo not available because device is strictly unmanaged)."