lxc / distrobuilder

System container image builder for LXC and Incus
https://linuxcontainers.org
Apache License 2.0
575 stars 166 forks source link

AlmaLinux 9 main interface is always down on boot #701

Closed lukasz-zaroda closed 1 year ago

lukasz-zaroda commented 1 year ago

There was a lot of tracking this down, but I found the culprit in this repo.

Host: Almalinux 9.1 Guest: Almalinux 9.1 nictype: routed

Tested on image:

  image.architecture: amd64
  image.description: Almalinux 9 amd64 (20230221_23:08)
  image.os: Almalinux
  image.release: "9"
  image.serial: "20230221_23:08"
  image.type: squashfs
  image.variant: cloud

Basically, the main interface is always down on boot (all NetworkManager's options for making it enabled don't work), and nmcli c show shows two conflicting connections:

[root@alma ~]# nmcli c show
NAME         UUID                                  TYPE      DEVICE 
eth0         94ee35c6-73f6-445b-bcd1-b8dfe265909f  ethernet  eth0   
System eth0  5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03  ethernet  --

It turns out the "eth0" one, which is incorrect (removing it fixes networking) is created in /etc/systemd/system-generators/lxc with fix_nm_link_state eth0 at the very end. Here is the location in the code: https://github.com/lxc/distrobuilder/blame/master/distrobuilder/main.go#L785

Just preventing this function from running restores networking on boot.

This function was introduced in https://github.com/lxc/distrobuilder/pull/642 .

The comment over the function's declaration states:

# fix_nm_link_state forces the network interface to a DOWN state ahead of NetworkManager starting up

Unfortunately, is messes up connections in the process, causing NM to not be able to get out of the DOWN state by itself, at least on the AlmaLinux 9 (and probably other RH-based distros).

The workaround for this issue is to either comment out the fix_nm_link_state eth0 line and reboot, or run nmcli c delete eth0 to remove the faulty connection. It needs to be done after every reboot, though. Adding this in cloud-init's configuration seems to work:

bootcmd:
  - nmcli c delete eth0 || true
monstermunchkin commented 1 year ago

@lukasz-zaroda are you still experiencing this issue?

I tried almalinux/9/cloud, and there is no eth0 interface. The System eth0 works as expected.

lukasz-zaroda commented 1 year ago

@lukasz-zaroda are you still experiencing this issue?

Yes. Although, I no longer use Almalinux 9 as a host. I just now tested it with Ubuntu 22.04 host, and the problem persists. Please let me know if I could provide you with additional information.

This is the profile used. Image: almalinux/9/cloud.

config:
  user.network-config: |
    #cloud-config
    version: 2
    ethernets:
      eth0:
        dhcp4: no
        dhcp6: no
        addresses:
        - [MY IP ADDRESS]/32
        nameservers:
          addresses:
          - 8.8.8.8
          - 8.8.4.4
          search: []
        routes:
        - to: 0.0.0.0/0
          via: 169.254.0.1
          on-link: true
  user.user-data: |
    #cloud-config
    package_update: true
    package_upgrade: true
    packages:
      - openssh-server
description: AlmaLinux Test.
devices:
  eth0:
    host_name: veth-alma
    ipv4.address: [MY IP ADDRESS]
    nictype: routed
    parent: enp1s0f0
    type: nic
  root:
    path: /
    pool: default
    size: 3GB
    type: disk
name: almalinux-test
monstermunchkin commented 1 year ago

I was able to reproduce your issue using the provided profile. However, I believe this has to do with the routed eth0 device as I couldn't reproduce it otherwise.

For now at least we won't change the system-generator as it solves most problems around NetworkManager in containers, and we don't want to break any other images using it.

Instead, we should allow masking units created by the lxc system-generator. Your issue can then be solved by running systemctl mask network-device-down.service and restarting the container. This only has to be done once, and there's no need to modify the cloud-config.

lukasz-zaroda commented 1 year ago

Thanks, that will help! Though, it doesn't change the fact, that container of a common distro family, in common configuration is broken out of the box, which ideally should be fixed. If system-generator is too convoluted right now, to risk fixing this, then maybe some refactoring is needed, and this task simply should be postponed instead of closed? Just my 2c. Either way, the new workaround is very welcomed! :)