naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
151 stars 63 forks source link

Naemon service fails to start after 1.4.2-19.1 RPM upgrade #454

Closed imre-sidn closed 6 months ago

imre-sidn commented 6 months ago

We have a dnf-automatic timer that periodically updates our RPM packages. Last night a number of packages, including Naemon, got updated. Unfortunately the naemon.service systemd unit got restarted as part of this update, and did not come up correctly.

Relevant log excerpts:

/var/log/dnf/dnf.log

2024-02-20T04:34:26+0000 DEBUG Upgraded: gnutls-3.6.16-8.el8_9.1.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: libmaxminddb-1.2.0-10.el8_9.1.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: libnaemon-1.4.2-19.1.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: libssh-0.9.6-13.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: libssh-config-0.9.6-13.el8_9.noarch
2024-02-20T04:34:26+0000 DEBUG Upgraded: naemon-1.4.2-14.1.noarch
2024-02-20T04:34:26+0000 DEBUG Upgraded: naemon-core-1.4.2-19.1.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: naemon-livestatus-1.4.2-15.1.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: naemon-thruk-1.4.2-14.1.noarch
2024-02-20T04:34:26+0000 DEBUG Upgraded: naemon-vimvault-1.4.2-4.1.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: nss-3.90.0-6.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: nss-softokn-3.90.0-6.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: nss-softokn-freebl-3.90.0-6.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: nss-sysinit-3.90.0-6.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: nss-tools-3.90.0-6.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: nss-util-3.90.0-6.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: openssh-8.0p1-19.el8_9.2.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: openssh-clients-8.0p1-19.el8_9.2.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: openssh-server-8.0p1-19.el8_9.2.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: python3-rpm-4.14.3-28.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: python3-unbound-1.19.1-1.el8.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: rpm-4.14.3-28.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: rpm-build-libs-4.14.3-28.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: rpm-libs-4.14.3-28.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: rpm-plugin-selinux-4.14.3-28.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: rpm-plugin-systemd-inhibit-4.14.3-28.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: sos-4.6.1-1.el8.noarch
2024-02-20T04:34:26+0000 DEBUG Upgraded: sudo-1.9.5p2-1.el8_9.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: tcpdump-14:4.9.3-3.el8_9.1.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: thruk-3.12-37.1.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: thruk-base-3.12-37.1.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: thruk-plugin-reporting-3.12-37.1.x86_64
2024-02-20T04:34:26+0000 DEBUG Upgraded: tzdata-2024a-1.el8.noarch
2024-02-20T04:34:26+0000 DEBUG Upgraded: unbound-libs-1.19.1-1.el8.x86_64

/var/log/naemon/naemon.log

[1708403630] Caught 'Terminated', shutting down...
[1708403630] Retention data successfully saved.
[1708403630] Successfully shutdown... (PID=1877)
[1708403630] Event broker module 'NERD' deinitialized successfully.
[1708403630] livestatus: deinitializing
[1708403630] livestatus: Logfile cache: flushing complete cache.
[1708403630] Event broker module '/usr/lib64/naemon/naemon-livestatus/livestatus.so' deinitialized successfully.
[1708403630] Naemon 1.4.2 starting... (PID=3350017)
[1708403630] Local time is Tue Feb 20 04:33:50 UTC 2024
[1708403630] LOG VERSION: 2.0
[1708403630] qh: Socket '/var/lib/naemon/naemon.qh' successfully initialized
[1708403630] nerd: Channel hostchecks registered successfully
[1708403630] nerd: Channel servicechecks registered successfully
[1708403630] nerd: Fully initialized and ready to rock!
[1708403630] Error: Module '/usr/lib64/naemon/naemon-livestatus/livestatus.so' is using an incompatible version (v5) of the event broker API (current version: v6). Module will be unloaded.
[1708403630] Event broker module '/usr/lib64/naemon/naemon-livestatus/livestatus.so' deinitialized successfully.
[1708403630] Error: Failed to load module '/usr/lib64/naemon/naemon-livestatus/livestatus.so'.
[1708403630] Error: Module loading failed. Aborting.
[1708403630] Event broker module 'NERD' deinitialized successfully.

Starting the failed service again later today resolved the issue. It looks like the Naemon service is restarted before the naemon-livestatus has been updated, resulting in an API compatibility during startup. I think this can be solved by delaying the Naemon service restart until after the naemon-livestatus package update has completed.

sni commented 6 months ago

It looks like the Naemon service is restarted before the naemon-livestatus has been updated, resulting in an API compatibility during startup. I think this can be solved by delaying the Naemon service restart until after the naemon-livestatus package update has completed.

That's sounds quite likely. Any idea if this is possible with RPMs?

nook24 commented 6 months ago

I don't know how the current packages are build, but the only package that should restart Naemon is naemon-core. At the moment there are probably several packages that are doing restarts of Naemon?

With deb this is pretty straightforward using a activate trigger. So all other packages such as naemon-livestatus can trigger naemon-core to do the restart. If multiple packages trigger, only one restart will be executed.

With rpm nothing is as easy as it should be. My solution to this is to use post-transaction-actions. https://github.com/rpm-software-management/dnf-plugins-core/blob/master/doc/post-transaction-actions.rst

sni commented 6 months ago

the livestatus packages does a condrestart in its %post script: https://github.com/naemon/naemon-livestatus/blob/master/naemon-livestatus.spec#L53 But that's not the issue here, because at that point naemon is stopped already.

The core package does a condrestart in the %post script: https://github.com/naemon/naemon-core/blob/master/naemon-core.spec#L196 But it seems like a good idea to move them to a later stage, ex the %posttrans

nook24 commented 6 months ago

I think condrestart is not helping in in case as naemon is running, but the new Naemon binary can not restart due to the old broker version is still in place 🤔

imre-sidn commented 6 months ago

Hi, thanks for your input on this. I'm looking at building a container image for easy reproducibility of this bug. Do you guys know where I can find the previous (1.4.1) RPMs? I can also build the RPMs myself if an archive is not available.

sni commented 6 months ago

try these: https://labs.consol.de/naemon/release/v1.4.1/

imre-sidn commented 6 months ago

I've created a container image (Containerfile) based on Rocky Linux 8 with libnaemon, naemon-core and naemon-livestatus 1.4.1 installed, and 1.4.2 RPMs on the filesystem. Steps to reproduce this issue with Docker or Podman:

  1. podman run --rm --detach --name naemon-livestatus ghcr.io/imre-sidn/naemon-livestatus:1.4.1
  2. podman exec --interactive --tty naemon-livestatus bash
  3. dnf install -y ./libnaemon-1.4.2.rhel8.x86_64.rpm ./naemon-core-1.4.2.rhel8.x86_64.rpm ./naemon-livestatus-1.4.2.rhel8.x86_64.rpm
  4. tail -n 20 /var/log/naemon/naemon.log
sni commented 6 months ago

should be fine then with the next release. Thanks for fixing this.