lxc / lxc-ci

LXC continuous integration and build scripts
https://jenkins.linuxcontainers.org
Apache License 2.0
260 stars 136 forks source link

centos/8/cloud does not initialize network #337

Closed aivanise closed 3 years ago

aivanise commented 3 years ago

cloud-init variant of centos8 does not initialize networking

it does not install network-scripts package any more and relies solely on NetworkManager, but that fails with "No suitable device found for this connection (device lo not available because device is strictly unmanaged)."

[root@aalex ~]# journalctl -u NetworkManager
-- Logs begin at Fri 2021-06-25 06:22:28 UTC, end at Fri 2021-06-25 06:23:31 UTC. --
...
Jun 25 06:22:29 aalex NetworkManager[554]: <info>  [1624602149.6998] manager: Networking is enabled by state file
Jun 25 06:22:29 aalex NetworkManager[554]: <info>  [1624602149.6999] dhcp-init: Using DHCP client 'internal'
Jun 25 06:22:29 aalex NetworkManager[554]: <info>  [1624602149.7009] settings: Loaded settings plugin: ifcfg-rh ("/usr/lib64/NetworkManager/1.32.0-0.5.el8/libnm-settings-plugin-ifcfg-rh.so")
Jun 25 06:22:29 aalex NetworkManager[554]: <info>  [1624602149.7010] settings: Loaded settings plugin: keyfile (internal)
Jun 25 06:22:29 aalex NetworkManager[554]: <info>  [1624602149.7041] device (lo): carrier: link connected
Jun 25 06:22:29 aalex NetworkManager[554]: <info>  [1624602149.7043] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Jun 25 06:22:29 aalex NetworkManager[554]: <info>  [1624602149.7051] manager: (eth0): new Veth device (/org/freedesktop/NetworkManager/Devices/2)
Jun 25 06:23:29 aalex NetworkManager[554]: <info>  [1624602209.7678] agent-manager: agent[c052cedeed60df4e,:1.8/nmcli-connect/0]: agent registered
Jun 25 06:23:29 aalex NetworkManager[554]: <info>  [1624602209.7684] audit: op="connection-activate" uuid="5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03" name="System eth0" result="fail" reason="No suitable device found for thi
s connection (device lo not available because device is strictly unmanaged)."
Jun 25 06:23:31 aalex systemd[1]: Reloading Network Manager.
Jun 25 06:23:31 aalex NetworkManager[554]: <info>  [1624602211.6271] audit: op="reload" arg="0" pid=677 uid=0 result="success"
Jun 25 06:23:31 aalex NetworkManager[554]: <info>  [1624602211.6274] config: signal: SIGHUP (no changes from disk)
Jun 25 06:23:31 aalex systemd[1]: Reloaded Network Manager.
stgraber commented 3 years ago

Yeah, cloud-init only deals with NetworkManager. We do however have daily tests for all our images which includes confirming that the network comes up online.

So this suggests an environment which is making NetworkManager unhappy. It's unfortunately known that NetworkManager doesn't deal well with macvlan, so if you're using macvlan, that's going to be a problem. If using normal bridging, then the issue may be with the version of the kernel you're running.

3eka commented 3 years ago

Hi all, network does not work w/o NetworkManager either...

For centos/8/default:

Regards.

aivanise commented 3 years ago

I'm using veth and normal bridging, and it was all working until the switch from network-scripts (https://github.com/lxc/lxc-ci/commit/f951837e5b62c15c9eb482691dc45c12e848afc8) NetworkManager was made. I'm also running the latest 8.4 RHEL kernel, but it is also failing in 8.3

Can you please point me to how the containers are created in your daily tests, maybe that will give me some clues.

stgraber commented 3 years ago
stgraber@castiana:~$ lxc launch images:centos/8/cloud c1
Creating c1
Starting c1                                 
stgraber@castiana:~$ lxc launch images:centos/8 c2
Creating c2
Starting c2                                 
stgraber@castiana:~$ lxc list c1
+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| NAME |  STATE  |        IPV4         |                     IPV6                      |   TYPE    | SNAPSHOTS |
+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| c1   | RUNNING | 10.166.11.32 (eth0) | fd42:4c81:5770:1eaf:216:3eff:fe42:485a (eth0) | CONTAINER | 0         |
+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
stgraber@castiana:~$ lxc list c2
+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| NAME |  STATE  |        IPV4         |                     IPV6                      |   TYPE    | SNAPSHOTS |
+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| c2   | RUNNING | 10.166.11.60 (eth0) | fd42:4c81:5770:1eaf:216:3eff:fe47:3dbd (eth0) | CONTAINER | 0         |
+------+---------+---------------------+-----------------------------------------------+-----------+-----------+

That's pretty much all we do in our tests.

aivanise commented 3 years ago

is there anything special in the default profile? host networking is bridged, I presume?

stgraber commented 3 years ago

Nothing special, normal lxdbr0 bridge setup.

3eka commented 3 years ago

Hi @stgraber,

I have tried:

:; lxc launch --console --ephemeral images:centos/8/cloud int9049
...
[1422975.484197] cloud-init[530]: Cloud-init v. 20.3-10.el8_4.2 running 'init-local' at Thu, 01 Jul 2021 06:59:13 +0000. Up 2.26 seconds.
[  OK  ] Started Initial cloud-init job (pre-networking).
[  OK  ] Reached target Network (Pre).
         Starting Network Manager...
[  OK  ] Started Network Manager.
         Starting Network Manager Wait Online...
[  OK  ] Reached target Network.
         Starting Network Manager Script Dispatcher Service...
[  OK  ] Started Network Manager Script Dispatcher Service.
[FAILED] Failed to start Network Manager Wait Online.
See 'systemctl status NetworkManager-wait-online.service' for details.
         Starting Activate connection...
         Starting Initial cloud-init job (metadata service crawler)...
[  OK  ] Started Activate connection.
[1423036.074977] cloud-init[566]: Cloud-init v. 20.3-10.el8_4.2 running 'init' at Thu, 01 Jul 2021 07:00:14 +0000. Up 63.01 seconds.
[1423036.075096] cloud-init[566]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
[1423036.075159] cloud-init[566]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[1423036.075202] cloud-init[566]: ci-info: | Device |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |
[1423036.075252] cloud-init[566]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[1423036.075296] cloud-init[566]: ci-info: |  eth0  | False |     .     |     .     |   .   | 00:16:3e:59:63:19 |
[1423036.075340] cloud-init[566]: ci-info: |   lo   |  True | 127.0.0.1 | 255.0.0.0 |  host |         .         |
[1423036.075381] cloud-init[566]: ci-info: |   lo   |  True |  ::1/128  |     .     |  host |         .         |
[1423036.075423] cloud-init[566]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[1423036.075469] cloud-init[566]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[1423036.075512] cloud-init[566]: ci-info: +-------+-------------+---------+-----------+-------+
[1423036.075556] cloud-init[566]: ci-info: | Route | Destination | Gateway | Interface | Flags |
[1423036.075602] cloud-init[566]: ci-info: +-------+-------------+---------+-----------+-------+
[1423036.075643] cloud-init[566]: ci-info: +-------+-------------+---------+-----------+-------+
...
CentOS Linux 8
Kernel 4.18.0-240.22.1.el8_3.x86_64 on an x86_64

and:

:; lxc list int9049 --columns=ns46t
+---------+---------+------+------+-----------------------+
|  NAME   |  STATE  | IPV4 | IPV6 |         TYPE          |
+---------+---------+------+------+-----------------------+
| int9049 | RUNNING |      |      | CONTAINER (EPHEMERAL) |
+---------+---------+------+------+-----------------------+

:; lxc exec int9049 -- bash
[root@int9049 ~]# journalctl -u NetworkManager-wait-online.service
-- Logs begin at Thu 2021-07-01 06:59:13 UTC, end at Thu 2021-07-01 07:01:01 UTC. --
Jul 01 06:59:14 int9049 systemd[1]: NetworkManager-wait-online.service: Failed to reset devices.list: Operation not permitted
Jul 01 06:59:14 int9049 systemd[1]: Starting Network Manager Wait Online...
Jul 01 07:00:14 int9049 systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Jul 01 07:00:14 int9049 systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
Jul 01 07:00:14 int9049 systemd[1]: Failed to start Network Manager Wait Online.

So, no image modification were done, but still no network (above service has failed). Regards

stgraber commented 3 years ago

Can you look at /sys/class/net in your container? I'm guessing the issue may be that your kernel is too old to have properly namespaces network interface events which then prevents NetworkManager from properly detecting the interface.

3eka commented 3 years ago

Hi @stgraber ,

here you are:

:; lxc launch --console --ephemeral images:centos/8/cloud int9049
...

and

:; lxc exec int9049 -- bash
[root@int9049 ~]# ls -al /sys/class/net
total 0
drwxr-xr-x  2 nobody nobody 0 Jul  2 08:29 .
drwxr-xr-x 57 nobody nobody 0 Jul  2 08:29 ..
lrwxrwxrwx  1 nobody nobody 0 Jul  2 08:29 eth0 -> ../../devices/virtual/net/eth0
lrwxrwxrwx  1 nobody nobody 0 Jul  2 08:29 lo -> ../../devices/virtual/net/lo
[root@int9049 ~]# ls -aLl /sys/class/net
total 0
drwxr-xr-x  2 nobody nobody 0 Jul  2 08:29 .
drwxr-xr-x 57 nobody nobody 0 Jul  2 08:29 ..
drwxr-xr-x  5 nobody nobody 0 Jul  2 08:29 eth0
drwxr-xr-x  5 nobody nobody 0 Jul  2 08:29 lo
[root@int9049 ~]# ls -al /sys/class/net/eth0/
total 0
drwxr-xr-x  5 nobody nobody    0 Jul  2 08:29 .
drwxr-xr-x 52 nobody nobody    0 Jul  2 08:29 ..
-r--r--r--  1 nobody nobody 4096 Jul  2 08:29 addr_assign_type
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 addr_len
-r--r--r--  1 nobody nobody 4096 Jul  2 08:29 address
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 broadcast
-rw-r--r--  1 nobody nobody 4096 Jul  2 08:31 carrier
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 carrier_changes
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 carrier_down_count
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 carrier_up_count
-r--r--r--  1 nobody nobody 4096 Jul  2 08:30 dev_id
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 dev_port
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 dormant
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 duplex
-rw-r--r--  1 nobody nobody 4096 Jul  2 08:31 flags
-rw-r--r--  1 nobody nobody 4096 Jul  2 08:31 gro_flush_timeout
-rw-r--r--  1 nobody nobody 4096 Jul  2 08:31 ifalias
-r--r--r--  1 nobody nobody 4096 Jul  2 08:30 ifindex
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 iflink
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 link_mode
-rw-r--r--  1 nobody nobody 4096 Jul  2 08:31 mtu
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 name_assign_type
-rw-r--r--  1 nobody nobody 4096 Jul  2 08:31 netdev_group
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 operstate
-r--r--r--  1 nobody nobody 4096 Jul  2 08:30 phys_port_id
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 phys_port_name
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 phys_switch_id
drwxr-xr-x  2 nobody nobody    0 Jul  2 08:31 power
-rw-r--r--  1 nobody nobody 4096 Jul  2 08:31 proto_down
drwxr-xr-x  4 nobody nobody    0 Jul  2 08:31 queues
-r--r--r--  1 nobody nobody 4096 Jul  2 08:31 speed
drwxr-xr-x  2 nobody nobody    0 Jul  2 08:31 statistics
lrwxrwxrwx  1 nobody nobody    0 Jul  2 08:29 subsystem -> ../../../../class/net
-rw-r--r--  1 nobody nobody 4096 Jul  2 08:31 tx_queue_len
-r--r--r--  1 nobody nobody 4096 Jul  2 08:29 type
-rw-r--r--  1 nobody nobody 4096 Jul  2 08:29 uevent

Regarding kernel: cluster members run following versions:

NOTE: no network in any case (e.g when using --target=MEMBER option for launch command on either cluster member). Regards.

stgraber commented 3 years ago

Thanks for that, so it unfortunately shows the problem. The RHEL kernel is lacking support for proper ownership of network interfaces in sysfs. This is work @brauner did quite a while back which more modern distros have in their kernel.

@brauner I'm only finding https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ef6a4c88e9e11bc32cd02b052d04745af9691412 which is more recent than I remembered, was there something else earlier than that?

If Christian can give you the needed commits, it may be possible for you to file a bug against the RHEL kernel and have Red Hat integrate the needed fixes so this all works out of the box.

Otherwise, Network Manager is probably not going to work for you, so your options are pretty much to manually configure networking using dhclient or ip, then install another network management tool and use that instead.

There's not really anything we can do on our side about this. For cloud images, cloud-init requires NetworkManager and as you've noticed, NetworkManager is a bit picky about its interfaces...

aivanise commented 3 years ago

I've tried applying the patch but i get a compile error, so it is definitely not everything. Can someone please give me the complete patch so that I can chase it with RedHat, if possible? @brauner ?

net/core/dev.c: In function 'dev_change_net_namespace': net/core/dev.c:10180:8: error: implicit declaration of function 'netdev_change_owner'; did you mean 'netdev_change_features'? [-Werror=implicit-function-declaration] err = netdev_change_owner(dev, net_old, net); ^~~~~~~ netdev_change_features cc1: all warnings being treated as errors

brauner commented 3 years ago

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f70ce185687bbe4e2d7ff126a8c890631f5fc2af https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0666a3aee762cd4f7981c2eed0fd8cab87533539 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=303a42769c4c4d8e5e3ad928df87eb36f8c1fa60 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2c4f9401ceb00167a3bfd322a28aa87b646a253f https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8f33e5d76a7a1b87e0cc760d05bf2477b4e91d6 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3b52fc5d7876a312e6a964d7e626ba05ab1ea6b2 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e6dee9f3893c823dff9c7f33fe0a598ee25c78f7 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d755407d4444c3e0fbd7d7c3aa666d595e9ab217

brauner commented 3 years ago

Everything's in this PR too: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ebb4a4bf76f164457184a3f43ebc1552416bc823

aivanise commented 3 years ago

Thank you, it actually applies fairly cleanly to the current RHEL8 kernel, I have filed a bug with redhat to try to convince them to backport it, let's see

https://bugzilla.redhat.com/show_bug.cgi?id=1979820