rhkdump / kdump-utils

Kernel crash dump collection utilities
GNU General Public License v2.0
3 stars 8 forks source link

nmcli logic in dracut-module-setup.sh fails after NetworkManager version update on specific systems #5

Open aburmash opened 2 months ago

aburmash commented 2 months ago

When system with ISCSI network disk used for system partition ( / ) is upgraded, and there are packages in the dnf transaction that trigger kdump initramfs rebuild AND new NetworkManager is installed as an update in the same transaction, kdump service will fail and kdump initramfs will not be regenerated. This happens for example during update from NetworkManager 1.44 to 1.46 ( centos-stream case ).

This is caused by a combination of expected NetworkManager behaviour and kdump using nmcli ( NetworkManager client to manage connections ) to generate kdump initramfs image with network.

After update NetworkManager does NOT restart the daemon ( this is expected and normal as well ), but only reloads the configuration. That causes the mismatch between running daemon and installed libs and clients ( also expected ). Certain nmcli commands fail, because new NetworkManager option added in 1.46 is not supported by running 1.44 daemon ( also expected ).

For example this command fails: https://github.com/rhkdump/kdump-utils/blob/main/dracut-module-setup.sh#L262 Error emitted in kdump logs is something like:

Apr 29 20:49:11 test-kexec kdumpctl[193959]: Warning: nmcli (1.46.0) and NetworkManager (1.44.0) versions don't match. Restarting NetworkManager is advised. Apr 29 20:49:11 test-kexec kdumpctl[193959]: Error: Failed to add 'Wired Connection' connection: connection.autoconnect-ports: unknown property Apr 29 20:49:11 test-kexec kdumpctl[193949]: dracut: Failed to clone 269caef6-3a85-419e-a645-483f25e94417 Apr 29 20:49:11 test-kexec dracut[189638]: Failed to clone 269caef6-3a85-419e-a645-483f25e94417 Apr 29 20:49:11 test-kexec kdumpctl[189607]: dracut: Failed to install the .nmconnection for ens300f0np0 Apr 29 20:49:11 test-kexec dracut[189638]: Failed to install the .nmconnection for ens300f0np0 Apr 29 20:49:11 test-kexec kdumpctl[189324]: kdump: mkdumprd: failed to make kdump initrd Apr 29 20:49:11 test-kexec kdumpctl[189324]: kdump: Starting kdump: [FAILED]

This is caused by the fact that running NM daemon 1.44 has no idea about a NEW 1.46 option autoconnect-ports ( that was added in 1.46, but was missing in 1.44 )

nmcli also emits a message suggesting restarting NetworkManager - "Warning: nmcli (1.46.0) and NetworkManager (1.44.0) versions don't match. Restarting NetworkManager is advised." This message is seen in journalctl logs and kdump service logs.

kdump running those nmcli commands will fail to do so and fail the service. During the update that triggers both updates of packages that trigger kdump rebuld and NM update user may end up with non-functional kdump up until reboot or NM + kdump service restart.

Not sure what should be the proper fix. From one point of view, "restart your system or restart NM + kdump" is a solution, but is there a better suggestion to look into ?

coiby commented 2 months ago

Hi @aburmash,

Thanks for reporting this issue! I'm curious to ask why do you think that NetworkManager daemon doesn't get restarted after the update "is expected and normal as well"? What's the risk of auto-restart NM after updating NM?

aburmash commented 2 months ago

Potential network loss. LAN connections are likely to persist through NM restart, but some connections may be interrupted during restart for a while, which specifically is a problem for enterprise environment. Most ( if not all ) RH based distros follow RH logic and reload NM instead of restarting NM. Change in NM package itself goes back to https://bugzilla.redhat.com/show_bug.cgi?id=811200

I believe the main problem itself is not just the requirement to restart NM + kdump or reboot the machine, the problem is that when you do yum update, that pulls in NM with similar change AND also some package that triggers kdump initramfs rebuild, kdump service is triggered in background ( because of monitored files change ), not as a post-script, so when kdump service post/during yum update fails, you actually have no idea that kdump is not OK, unless you specifically decide to inspect journalctl / systemctl status kdump.

coiby commented 1 month ago

Thanks for explaining the side-effect of restarting NM to me!

I believe kdump.service gets started because kexec-tools (kdump-utils) itself gets updated and the following scriptlet is triggered

%postun
%systemd_postun_with_restart kdump.service

Can you confirm it? If that's the case, maybe we can add an exception for this case where "Warning: nmcli (1.46.0) and NetworkManager (1.44.0) versions don't match. Restarting NetworkManager is advised" is detected.