liske / needrestart

Restart daemons after library updates.
GNU General Public License v2.0
426 stars 67 forks source link

New to needrestart, need help understanding a situation #258

Closed koitsu closed 1 year ago

koitsu commented 1 year ago

I'm currently running Ubuntu 22.04 LTS (jammy). I have package wsdd installed, and the service runs fine. However, I've noticed that any time there is a super basic update that should not warrant a service restart -- for example, nothing more than a kernel upgrade -- that needrestart starts complaining about the wsdd service. If you want to know what wsdd is: https://github.com/christgau/wsdd

What I'm trying to understand is why any part of the system thinks this userland service -- and only this service -- needs to be restarted simply because of a kernel update. No libraries are changing, nothing relating to Python 3 is changing, no packages are changing, so I'm confused.

Part of me suspects that the wsdd systemd unit is incorrect in some fashion, but I have no idea how/why.

Here's the real-world situation I have seen time and time again:

$ sudo -i
# export DEBIAN_FRONTEND=noninteractive
# apt-get -y update && apt-get -y dist-upgrade
Hit:1 http://us.archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://us.archive.ubuntu.com/ubuntu jammy-updates InRelease [114 kB]
Get:3 http://us.archive.ubuntu.com/ubuntu jammy-backports InRelease [99.8 kB]
Get:4 http://us.archive.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:5 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [734 kB]
Get:6 http://us.archive.ubuntu.com/ubuntu jammy-updates/main Translation-en [165 kB]
Get:7 http://us.archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [453 kB]
Get:8 http://us.archive.ubuntu.com/ubuntu jammy-updates/restricted Translation-en [69.3 kB]
Get:9 http://us.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [757 kB]
Fetched 2,503 kB in 2s (1,404 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following packages will be upgraded:
  linux-firmware
1 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 240 MB of archives.
After this operation, 8,080 kB of additional disk space will be used.
Get:1 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-firmware all 20220329.git681281e4-0ubuntu3.7 [240 MB]
Fetched 240 MB in 7s (33.5 MB/s)
(Reading database ... 111537 files and directories currently installed.)
Preparing to unpack .../linux-firmware_20220329.git681281e4-0ubuntu3.7_all.deb ...
Unpacking linux-firmware (20220329.git681281e4-0ubuntu3.7) over (20220329.git681281e4-0ubuntu3.6) ...
Setting up linux-firmware (20220329.git681281e4-0ubuntu3.7) ...
update-initramfs: Generating /boot/initrd.img-5.15.0-53-generic
update-initramfs: Generating /boot/initrd.img-5.15.0-52-generic
NEEDRESTART-VER: 3.5
NEEDRESTART-KCUR: 5.15.0-53-generic
NEEDRESTART-KEXP: 5.15.0-53-generic
NEEDRESTART-KSTA: 1
NEEDRESTART-UCSTA: 1
NEEDRESTART-UCCUR: 0x002f
NEEDRESTART-UCEXP: 0x002f
NEEDRESTART-SVC: wsdd.service

I ran needrestart -b -vvvv to try and figure out what's going on:

$ sudo needrestart -b -vvvv
[main] eval /etc/needrestart/needrestart.conf
[main] needrestart v3.5
[main] running in root mode
[main] systemd detected
NEEDRESTART-VER: 3.5
[Core] #586 is a NeedRestart::Interp::Python
[Python] #586: source=/usr/bin/networkd-dispatcher
[Core] #4478 is a NeedRestart::Interp::Python
[Python] #4478: source=/usr/sbin/wsdd
[Core] #4478 uses obsolete script file(s):
[Core] #4478  /usr/sbin/wsdd
[main] #4478 is not a child
[main] #4478 exe => /usr/bin/python3.10
[Core] #4478 is a NeedRestart::Interp::Python
[Core] #4478 source is /usr/sbin/wsdd
[main] trying systemctl status
[main] #4478 is wsdd.service
[ucode] using NeedRestart::uCode::Intel
[ucode] using NeedRestart::uCode::AMD
[uCode/Intel] #0 current revision: 0x002f
+ iucode_tool --scan-system
+ grep -oE [^[:space:]]+$
+ sig=0x000306d4
+ [ -r /sys/devices/system/cpu/cpu0/microcode/processor_flags ]
+ cat /sys/devices/system/cpu/cpu0/microcode/processor_flags
+ filter=-s 0x000306d4,0x40
+ test -r /etc/needrestart/iucode.sh
+ . /etc/needrestart/iucode.sh
+ type bsdtar
+ IUCODE_TOOL_EXTRA_OPTIONS=
+ test -r /etc/default/intel-microcode
+ . /etc/default/intel-microcode
+ test  = no
+ [ -r /usr/share/misc/intel-microcode* ]
+ iucode_tool -l -s 0x000306d4,0x40 --ignore-broken -tb /lib/firmware/intel-ucode
+ grep 0x000306d4
[uCode/Intel] #0 available revision: 0x002f
[uCode/AMD] #0 cpu vendor id mismatch
[Kernel] Linux: kernel release 5.15.0-53-generic, kernel version #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022
[Kernel/Linux] /boot/vmlinuz.old => 5.15.0-52-generic (buildd@lcy02-amd64-032) #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 [5.15.0-52-generic]
[Kernel/Linux] /boot/vmlinuz-5.15.0-53-generic => 5.15.0-53-generic (buildd@lcy02-amd64-047) #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 [5.15.0-53-generic]*
[Kernel/Linux] /boot/vmlinuz-5.15.0-52-generic => 5.15.0-52-generic (buildd@lcy02-amd64-032) #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 [5.15.0-52-generic]
[Kernel/Linux] /boot/vmlinuz => 5.15.0-53-generic (buildd@lcy02-amd64-047) #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 [5.15.0-53-generic]*
[Kernel/Linux] Expected linux version: 5.15.0-53-generic
NEEDRESTART-KCUR: 5.15.0-53-generic
NEEDRESTART-KEXP: 5.15.0-53-generic
NEEDRESTART-KSTA: 1
NEEDRESTART-UCSTA: 1
NEEDRESTART-UCCUR: 0x002f
NEEDRESTART-UCEXP: 0x002f
NEEDRESTART-SVC: wsdd.service

Processes mentioned above:

$ ps -auxw | grep 586
root         586  0.0  0.2  35360 18920 ?        Ss   Nov24   0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
$ ps -auxw | grep 4478
wsdd        4478  0.0  0.3  38356 26972 ?        Ss   Nov25   0:00 python3 /usr/sbin/wsdd --shortlog --chroot=/run/wsdd -n XXX -w XXXX -v
$ pstree -A -a -g -S -s 586
systemd,1
  `-networkd-dispat,586 /usr/bin/networkd-dispatcher --run-startup-triggers
$ pstree -A -a -g -S -s 4478
systemd,1
  `-python3,4478 /usr/sbin/wsdd --shortlog --chroot=/run/wsdd -n XXX -w XXXX -v

The wsdd systemd service unit (just to make your life easy):

$ cat /lib/systemd/system/wsdd.service
[Unit]
Description=Web Services Dynamic Discovery host daemon
Documentation=man:wsdd(8)
; Start after the network has been configured
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
; The service is put into an empty runtime directory chroot,
; i.e. the runtime directory which usually resides under /run
EnvironmentFile=/etc/default/wsdd
ExecStart=/usr/sbin/wsdd --shortlog --chroot=/run/wsdd $WSDD_PARAMS
DynamicUser=yes
User=wsdd
Group=wsdd
RuntimeDirectory=wsdd
AmbientCapabilities=CAP_SYS_CHROOT

[Install]
WantedBy=multi-user.target
liske commented 1 year ago

Hi,

thanks for submitting all required details:

[Python] #4478: source=/usr/sbin/wsdd
[Core] #4478 uses obsolete script file(s):
[Core] #4478  /usr/sbin/wsdd

This triggers needrestart's interpreter support on wsdd. Can you please check if the file is changed after starting wsdd? Maybe some auto update feature (just guessing)?

koitsu commented 1 year ago

File isn't changed. Some other services show up now when using needrestart because I recently did updates but did not reboot:

$ sudo systemctl status wsdd
● wsdd.service - Web Services Dynamic Discovery host daemon
     Loaded: loaded (/lib/systemd/system/wsdd.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-12-01 02:10:42 PST; 5 days ago
       Docs: man:wsdd(8)
   Main PID: 46045 (python3)
      Tasks: 1 (limit: 9269)
     Memory: 16.1M
        CPU: 598ms
     CGroup: /system.slice/wsdd.service
             └─46045 python3 /usr/sbin/wsdd --shortlog --chroot=/run/wsdd -n XXX -w XXXX -v
...
$ sudo stat /usr/sbin/wsdd
  File: /usr/sbin/wsdd
  Size: 63625           Blocks: 128        IO Block: 4096   regular file
Device: 802h/2050d      Inode: 7734518     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-12-05 18:13:55.518057725 -0800
Modify: 2022-02-10 15:34:36.000000000 -0800
Change: 2022-11-19 16:15:13.015997190 -0800
 Birth: 2022-11-19 16:15:13.007997285 -0800
$ sudo systemctl stop wsdd
$ sudo stat /usr/sbin/wsdd
  File: /usr/sbin/wsdd
  Size: 63625           Blocks: 128        IO Block: 4096   regular file
Device: 802h/2050d      Inode: 7734518     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-12-05 18:13:55.518057725 -0800
Modify: 2022-02-10 15:34:36.000000000 -0800
Change: 2022-11-19 16:15:13.015997190 -0800
 Birth: 2022-11-19 16:15:13.007997285 -0800
$ sudo systemctl start wsdd
$ sudo systemctl status wsdd
● wsdd.service - Web Services Dynamic Discovery host daemon
     Loaded: loaded (/lib/systemd/system/wsdd.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-12-06 14:58:00 PST; 6min ago
       Docs: man:wsdd(8)
   Main PID: 56331 (python3)
      Tasks: 1 (limit: 9269)
     Memory: 16.0M
        CPU: 241ms
     CGroup: /system.slice/wsdd.service
             └─56331 python3 /usr/sbin/wsdd --shortlog --chroot=/run/wsdd -n XXX -w XXXX -v
...
$ sudo stat /usr/sbin/wsdd
  File: /usr/sbin/wsdd
  Size: 63625           Blocks: 128        IO Block: 4096   regular file
Device: 802h/2050d      Inode: 7734518     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-12-05 18:13:55.518057725 -0800
Modify: 2022-02-10 15:34:36.000000000 -0800
Change: 2022-11-19 16:15:13.015997190 -0800
 Birth: 2022-11-19 16:15:13.007997285 -0800
$ sudo needrestart -b -vvvv
[main] eval /etc/needrestart/needrestart.conf
[main] needrestart v3.5
[main] running in root mode
[main] systemd detected
NEEDRESTART-VER: 3.5
[main] #583 uses deleted /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.7200.1
[main] #583 is not a child
[main] #586 uses deleted /usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.7200.1
[main] #586 is not a child
[main] #595 uses deleted /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.13
[main] #595 is not a child
[Core] #56331 is a NeedRestart::Interp::Python
[Python] #56331: source=/usr/sbin/wsdd
[Core] #56331 uses obsolete script file(s):
[Core] #56331  /usr/sbin/wsdd
[main] #56331 is not a child
[main] #583 exe => /usr/sbin/irqbalance
[main] trying systemctl status
[main] #583 is irqbalance.service
[main] #586 exe => /usr/bin/python3.10
[Core] #586 is a NeedRestart::Interp::Python
[Core] #586 source is /usr/bin/networkd-dispatcher
[main] trying systemctl status
[main] #586 is networkd-dispatcher.service
[main] #595 exe => /usr/sbin/thermald
[main] trying systemctl status
[main] #595 is thermald.service
[main] #56331 exe => /usr/bin/python3.10
[Core] #56331 is a NeedRestart::Interp::Python
[Core] #56331 source is /usr/sbin/wsdd
[main] trying systemctl status
[main] #56331 is wsdd.service
[ucode] using NeedRestart::uCode::Intel
[ucode] using NeedRestart::uCode::AMD
[uCode/Intel] #0 current revision: 0x002f
+ iucode_tool --scan-system
+ grep -oE [^[:space:]]+$
+ sig=0x000306d4
+ [ -r /sys/devices/system/cpu/cpu0/microcode/processor_flags ]
+ cat /sys/devices/system/cpu/cpu0/microcode/processor_flags
+ filter=-s 0x000306d4,0x40
+ test -r /etc/needrestart/iucode.sh
+ . /etc/needrestart/iucode.sh
+ type bsdtar
+ IUCODE_TOOL_EXTRA_OPTIONS=
+ test -r /etc/default/intel-microcode
+ . /etc/default/intel-microcode
+ test  = no
+ [ -r /usr/share/misc/intel-microcode* ]
+ iucode_tool -l -s 0x000306d4,0x40 --ignore-broken -tb /lib/firmware/intel-ucode
+ grep 0x000306d4
[uCode/Intel] #0 available revision: 0x002f
[uCode/AMD] #0 cpu vendor id mismatch
[Kernel] Linux: kernel release 5.15.0-53-generic, kernel version #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022
[Kernel/Linux] /boot/vmlinuz.old => 5.15.0-53-generic (buildd@lcy02-amd64-047) #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 [5.15.0-53-generic]*
[Kernel/Linux] /boot/vmlinuz-5.15.0-56-generic => 5.15.0-56-generic (buildd@lcy02-amd64-004) #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 [5.15.0-56-generic]
[Kernel/Linux] /boot/vmlinuz-5.15.0-53-generic => 5.15.0-53-generic (buildd@lcy02-amd64-047) #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 [5.15.0-53-generic]*
[Kernel/Linux] /boot/vmlinuz => 5.15.0-56-generic (buildd@lcy02-amd64-004) #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 [5.15.0-56-generic]
[Kernel/Linux] Expected linux version: 5.15.0-56-generic
NEEDRESTART-KCUR: 5.15.0-53-generic
NEEDRESTART-KEXP: 5.15.0-56-generic
NEEDRESTART-KSTA: 3
NEEDRESTART-UCSTA: 1
NEEDRESTART-UCCUR: 0x002f
NEEDRESTART-UCEXP: 0x002f
NEEDRESTART-SVC: irqbalance.service
NEEDRESTART-SVC: networkd-dispatcher.service
NEEDRESTART-SVC: thermald.service
NEEDRESTART-SVC: wsdd.service

Filesystem is not mounted in any particularly strange way either (if it was, it would affect more than just wsdd):

$ mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=3954876k,nr_inodes=988719,mode=755,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=802340k,mode=755,inode64)
/dev/sda2 on / type ext4 (rw,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k,inode64)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=20552)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
none on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=802340k,nr_inodes=200585,mode=700,uid=1000,gid=1000,inode64)

I also have things like Ubuntu's "background updater" turned off in numerous ways:

systemctl mask --now apt-daily-upgrade.timer
systemctl reset-failed apt-daily-upgrade.timer
systemctl mask --now apt-daily.timer
systemctl reset-failed apt-daily.timer
cat > /etc/apt/apt.conf <<- 'EOF'
APT::Periodic::Enable "0";
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Unattended-Upgrade "0";
Dir::Log::Planner "/dev/null";
EOF
rm -f /var/log/apt/eipp.*

Finally, system's clock is in sync via NTP and so on, so clock drift/etc. wouldn't cause this either.

Any other ideas?

koitsu commented 1 year ago

One thing I noticed, purely as a comparison between wsdd and networkd-dispatcher:

[Core] #56331 is a NeedRestart::Interp::Python
[Python] #56331: source=/usr/sbin/wsdd
[Core] #56331 uses obsolete script file(s):
[Core] #56331  /usr/sbin/wsdd
[main] #56331 is not a child
...
[main] #586 exe => /usr/bin/python3.10
[Core] #586 is a NeedRestart::Interp::Python
[Core] #586 source is /usr/bin/networkd-dispatcher
[main] trying systemctl status
[main] #586 is networkd-dispatcher.service
...

$ head -2 /usr/bin/networkd-dispatcher
#! /usr/bin/python3
# networkd-dispatcher
$ head -2 /usr/sbin/wsdd
#!/usr/bin/env python3

Note there is no exe => line for wsdd. Is using the /usr/bin/env method somehow confusing needrestart?

liske commented 1 year ago

The /proc/$PID/exe link should be the same since env is doing a exec after finding the python3 binary.

The interpreter thing does the following:

Are there any namespaces involved? This might break the startup time detection.

koitsu commented 1 year ago

No namespaces involved here, purely standard out-of-box Ubuntu system. wsdd.service unit is provided above; I don't see any namespaces exclusively used there either.

Further comparing networkd-dispatch and wsdd:

$ sudo systemctl status networkd-dispatcher
● networkd-dispatcher.service - Dispatcher daemon for systemd-networkd
     Loaded: loaded (/lib/systemd/system/networkd-dispatcher.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-11-24 01:59:28 PST; 1 week 5 days ago
   Main PID: 586 (networkd-dispat)
      Tasks: 1 (limit: 9269)
     Memory: 10.1M
        CPU: 123ms
     CGroup: /system.slice/networkd-dispatcher.service
             └─586 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers

Random other informational things, since you mention process start time and so on:

$ sudo ls -l /proc/586/exe
lrwxrwxrwx 1 root root 0 Nov 24 01:59 /proc/586/exe -> /usr/bin/python3.10

$ getconf CLK_TCK
100
$ sudo ls -l /proc/56331/exe
lrwxrwxrwx 1 wsdd wsdd 0 Dec  6 14:58 /proc/56331/exe -> /usr/bin/python3.10
$ sudo cat /proc/56331/stat
56331 (python3) S 1 56331 56331 0 -1 4194560 4802 0 0 0 21 2 0 0 20 0 1 0 108352728 39276544 6696 18446744073709551615 94180287012864 94180289840653 140731523466096 0 0 0 0 16781312 16386 1 0 0 17 3 0 0 0 0 0 94180292196496 94180292480216 94180324515840 140731523468942 140731523469021 140731523469021 140731523469287 0
$ sudo cat /proc/56331/stat | awk '{ print $22 }'
108352728
$ expr 108352728 / 100
1083527
$ sudo systemctl restart wsdd
$ sudo systemctl status wsdd | grep 'Main PID'
   Main PID: 56478 (python3)
$ sudo cat /proc/56478/stat
56478 (python3) S 1 56478 56478 0 -1 4194560 4805 0 0 0 21 3 0 0 20 0 1 0 108498139 39276544 6722 18446744073709551615 94649767342080 94649770169869 140733399790528 0 0 0 0 16781312 16386 1 0 0 17 3 0 0 0 0 0 94649772525712 94649772809432 94649789870080 140733399793294 140733399793373 140733399793373 140733399793639 0
$ sudo cat /proc/56478/stat | awk '{ print $22 }'
108498139
$ expr 108498139 / 100
1084981
koitsu commented 1 year ago

I've done Perl most of my life, so I'll try to reverse-engineer needrestart and figure out what is going on here.

koitsu commented 1 year ago

Using the version in GitHub master branch (version 3.6), which is newer than what Ubuntu 22.04 LTS has (version 3.5), but same result.

Debugging hackery:

diff --git a/perl/lib/NeedRestart.pm b/perl/lib/NeedRestart.pm
index 117bb13..99f8701 100644
--- a/perl/lib/NeedRestart.pm
+++ b/perl/lib/NeedRestart.pm
@@ -178,14 +178,31 @@ sub needrestart_interp_check($$$$$) {
            }

            if(grep {!defined($_) || $_ > $ps->start + $tolerance} values %files) {
                if($debug) {
                    print STDERR "$LOGPREF #$pid uses obsolete script file(s):";
                    print STDERR join("\n$LOGPREF #$pid  ", '', map {(!defined($files{$_}) || $files{$_} > $ps->start ? $_ : ())} keys %files);
                    print STDERR "\n";
+                   print STDERR "=== DEBUG DUMP START ===\n";
+                   use Data::Dumper;
+                   $Data::Dumper::Sortkeys = 1;
+                   print STDERR "debug       = ", $debug, "\n";
+                   print STDERR "pid         = ", $pid, "\n";
+                   print STDERR "bin         = ", $bin, "\n";
+                   print STDERR "tolerance   = ", $tolerance, "\n";
+                   print STDERR "ps->start   = ", $ps->start, "\n";
+                   print STDERR "start+tol   = ", $ps->start + $tolerance, "\n";
+                   print STDERR "blacklist   = ", Dumper(\@{$blacklist});
+                   print STDERR "files       = ", Dumper(\%files);
+                   print STDERR "values files = ", Dumper(values %files);
+                   print STDERR "interp      = ", Dumper(\%{$interp});
+                   print STDERR "InterpCache = ", Dumper(\%InterpCache);
+                   print STDERR "values InterpCache = ", Dumper(values %InterpCache);
+                   print STDERR "ps          = ", Dumper(\%{$ps});
+                   print STDERR "=== DEBUG DUMP END ===\n";
                }

                return 1;
            }
        }
     }

Test run:

$ sudo env perl -I ${HOME}/needrestart/perl/lib ${HOME}/needrestart/needrestart -b -vvvv 2>&1
[main] eval /etc/needrestart/needrestart.conf
[main] needrestart v3.6
...
[Core] #56478 is a NeedRestart::Interp::Python
[Python] #56478: source=/usr/sbin/wsdd
[Core] #56478 uses obsolete script file(s):
[Core] #56478  /usr/sbin/wsdd
=== DEBUG DUMP START ===
debug       = 1
pid         = 56478
bin         = /usr/bin/python3.10
tolerance   = 2
ps->start   = 1670368934
start+tol   = 1670368936
blacklist   = $VAR1 = [
          qr/^\/tmp\//,
          qr/^\/var\//,
          qr/^\/run\//
        ];
files       = $VAR1 = {
          '/usr/sbin/wsdd' => undef
        };
values files = $VAR1 = undef;
interp      = $VAR1 = bless( {
                 'debug' => 1
               }, 'NeedRestart::Interp::Python' );
InterpCache = $VAR1 = {
          'files' => {
                       'NeedRestart::Interp::Python' => {
                                                          '/usr/sbin/wsdd' => {
                                                                                '/usr/sbin/wsdd' => undef
                                                                              }
                                                        }
                     }
        };
values InterpCache = $VAR1 = {
          'NeedRestart::Interp::Python' => {
                                             '/usr/sbin/wsdd' => {
                                                                   '/usr/sbin/wsdd' => undef
                                                                 }
                                           }
        };
ps          = $VAR1 = bless( {
                 'cmajflt' => '0',
                 'cmdline' => [
                                'python3',
                                '/usr/sbin/wsdd',
                                '--shortlog',
                                '--chroot=/run/wsdd',
                                '-n',
                                'XXX',
                                '-w',
                                'XXXX',
                                '-v'
                              ],
                 'cminflt' => '0',
                 'cmndline' => 'python3 /usr/sbin/wsdd --shortlog --chroot=/run/wsdd -n XXX -w XXXX -v',
                 'cstime' => '0',
                 'ctime' => '0',
                 'cutime' => '0',
                 'cwd' => '/run/wsdd',
                 'egid' => 64757,
                 'environ' => [
                                'LANG=en_US.UTF-8',
                                'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin',
                                'LOGNAME=wsdd',
                                'USER=wsdd',
                                'INVOCATION_ID=22b7cebadb424ec99ef5627e469db560',
                                'JOURNAL_STREAM=8:172554',
                                'RUNTIME_DIRECTORY=/run/wsdd',
                                'SYSTEMD_EXEC_PID=56478',
                                'WSDD_PARAMS=-n XXX -w XXXX -v'
                              ],
                 'euid' => 64757,
                 'exec' => '/usr/bin/python3.10',
                 'fgid' => 64757,
                 'flags' => '4194560',
                 'fname' => 'python3',
                 'fuid' => 64757,
                 'gid' => 64757,
                 'majflt' => '0',
                 'minflt' => '4805',
                 'pctcpu' => '  0.01',
                 'pctmem' => '0.34',
                 'pgrp' => 56478,
                 'pid' => 56478,
                 'ppid' => 1,
                 'priority' => 20,
                 'rss' => '27533312',
                 'sess' => 56478,
                 'sgid' => 64757,
                 'size' => '39276544',
                 'start' => '1670368934',
                 'state' => 'sleep',
                 'stime' => '30000',
                 'suid' => 64757,
                 'time' => '240000',
                 'tracer' => 0,
                 'ttydev' => '',
                 'ttynum' => 0,
                 'uid' => 64757,
                 'utime' => '210000',
                 'wchan' => '1'
               }, 'Proc::ProcessTable::Process' );
=== DEBUG DUMP END ===

Analysis: %files hash contains a k/v pair, key of the script pathname, value of undef. It is the value which triggers the false positive (restart service), because if(grep {!defined($_) || $_ > $ps->start + $tolerance} values %files) covers this situation (the !defined($_) is what's causing it).

I'm incredibly confused by what this statement is trying to do. It's iterating over all the values in %files (so, a single undef value). The grep conditional will then match true on !defined($_) (because value undef); the $_ > $ps->start + $tolerance test isn't executed (due to the || operator) -- but even it were, if (undef > 1670368936) is still valid (and non-warning-emitting) Perl code; it proves false.

The one thing I have yet to answer is: should %files hash actually have a valid number instead of undef? Is that the bug here? Or is that normal considering this situation?

Clearly further analysis is needed, but seeing this overall logic path is making me question what the intention is/was.

Commit 2c74dc03 recently touched this code (to add support for tolerance), but does not appear to be at fault. We must go back further using git blame 2c74dc03^ NeedRestart.pm. Lo and behold we find adb008eb:

commit adb008eb7a103ddb1fcfffe4acfdca615cc566c4
Author: Thomas Liske <thomas@fiasko-nw.net>
Date:   Tue Dec 9 23:47:30 2014 +0100

    Handle script files without timestamp as obsolete files.

diff --git a/perl/lib/NeedRestart.pm b/perl/lib/NeedRestart.pm
index ec55ba9..84d88d3 100644
--- a/perl/lib/NeedRestart.pm
+++ b/perl/lib/NeedRestart.pm
@@ -147,10 +147,10 @@ sub needrestart_interp_check($$$) {
            my $ps = nr_ptable_pid($pid);
            my %files = $interp->files($pid);

-           if(grep {$_ > $ps->start} values %files) {
+           if(grep {!defined($_) || $_ > $ps->start} values %files) {
                if($debug) {
                    print STDERR "$LOGPREF #$pid uses obsolete script file(s):";
-                   print STDERR join("\n$LOGPREF #$pid  ", '', map {($files{$_} > $ps->start ? $_ : ())} keys %files);
+                   print STDERR join("\n$LOGPREF #$pid  ", '', map {(!defined($files{$_}) || $files{$_} > $ps->start ? $_ : ())} keys %files);
                    print STDERR "\n";
                }

"Handle script files without timestamp"... and why would this script not have a timestamp? :) What is meant here by "timestamp"? Actual stat(2) file timestamp? And if so, what field? I provided stat(1) output earlier to demonstrate that the file is not changing. But again, I am left wondering why this %files hash has an undef value when, ideally, it should have something.

I tried turning the hashbang line of /usr/sbin/wsdd into #!/usr/bin/python3 just to see if there was any improvement, but as you kind of eluded to, the answer was no.

I suspect I will need to look into what NeedRestart/Interp/Python.pm files() function is doing under the hood.

koitsu commented 1 year ago

OK, I've figured out the cause. It is indeed in NeedRestart/Interp/Python.pm. First, more debugging hackery:

diff --git a/perl/lib/NeedRestart/Interp/Python.pm b/perl/lib/NeedRestart/Interp/Python.pm
index e4be37c..fcea534 100644
--- a/perl/lib/NeedRestart/Interp/Python.pm
+++ b/perl/lib/NeedRestart/Interp/Python.pm
@@ -206,6 +206,21 @@ sub files {
     my ($path) = <$pyread>;
     close($pyread);

+    print STDERR "=== DEBUG NeedRestart/Interp/Python.pm DUMP START ===\n";
+    use Data::Dumper;
+    $Data::Dumper::Sortkeys = 1;
+    print STDERR "pid        = ", $pid, "\n";
+    print STDERR "bin        = ", $bin, "\n";
+    print STDERR "opts       = ", Dumper(\%opts);
+    print STDERR "e          = ", Dumper(\%e);
+    print STDERR "ENV        = ", Dumper(\%ENV);
+    print STDERR "src        = ", $src, "\n";
+    print STDERR "cwd        = ", $cwd, "\n";
+    print STDERR "getcwd()   = ", getcwd(), "\n";
+    print STDERR "pyread     = ", $pyread, "\n";
+    print STDERR "pywrite    = ", $pywrite, "\n";
+    print STDERR "path orig  = ", $path, "\n";
+
     # look for module source files
     my @path;
     if(defined($path)) {
@@ -221,13 +236,24 @@ sub files {
     my %files;
     _scan($self->{debug}, $pid, $src, \%files, \@path);

+    print STDERR "path after = ", Dumper(\@path);
+    print STDERR "files      = ", Dumper(\%files);
+
     my %ret = map {
        my $stat = nr_stat("/proc/$pid/root/$_");
+       print STDERR "_          = ", $_, "\n";
+       print STDERR "stat       = ", $stat, "\n";
+       print STDERR "nr_statpath= /proc/$pid/root/$_\n";
+       print STDERR `ls -l /proc/$pid/root/$_`;
        $_ => ( defined($stat) ? $stat->{ctime} : undef );
     } keys %files;

+    print STDERR "ret        = ", Dumper(\%ret);
+
     chdir($cwd);

+    print STDERR "=== DEBUG NeedRestart/Interp/Python.pm DUMP END ===\n";
+
     $cache->{files}->{(__PACKAGE__)}->{$src} = \%ret;
     return %ret;
 }

Test results:

[Core] #58517 is a NeedRestart::Interp::Python
[Python] #58517: source=/usr/sbin/wsdd
=== DEBUG NeedRestart/Interp/Python.pm DUMP START ===
pid        = 58517
bin        = /usr/bin/python3.10
opts       = $VAR1 = {};
e          = $VAR1 = {
          'INVOCATION_ID' => '1ecfdf69a76b4b9fa7213a3a823912de',
          'JOURNAL_STREAM' => '8:179658',
          'LANG' => 'en_US.UTF-8',
          'LOGNAME' => 'wsdd',
          'PATH' => '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin',
          'RUNTIME_DIRECTORY' => '/run/wsdd',
          'SYSTEMD_EXEC_PID' => '58517',
          'USER' => 'wsdd',
          'WSDD_PARAMS' => '-n XXX -w XXXX -v'
        };
ENV        = $VAR1 = {};
src        = /usr/sbin/wsdd
cwd        = /home/jdc/needrestart/perl/lib
getcwd()   = /home/jdc/needrestart/perl/lib
pyread     = GLOB(0x556f1bd541b0)
pywrite    = GLOB(0x556f1bd54120)
path orig  = ['', '/usr/lib/python310.zip', '/usr/lib/python3.10', '/usr/lib/python3.10/lib-dynload', '/usr/local/lib/python3.10/dist-packages', '/usr/lib/python3/dist-packages']

path after = $VAR1 = [
          '/proc/58517/root/',
          '/proc/58517/root//usr/lib/python310.zip',
          '/proc/58517/root//usr/lib/python3.10',
          '/proc/58517/root//usr/lib/python3.10/lib-dynload',
          '/proc/58517/root//usr/local/lib/python3.10/dist-packages',
          '/proc/58517/root//usr/lib/python3/dist-packages\']'
        ];
files      = $VAR1 = {
          '/usr/sbin/wsdd' => 1
        };
_          = /usr/sbin/wsdd
stat       = Use of uninitialized value $stat in print at /home/jdc/needrestart/perl/lib/NeedRestart/Interp/Python.pm line 245.

nr_statpath= /proc/58517/root//usr/sbin/wsdd
ls: cannot access '/proc/58517/root//usr/sbin/wsdd': No such file or directory
ret        = $VAR1 = {
          '/usr/sbin/wsdd' => undef
        };

Now we're getting somewhere. It looks like nr_stat() returns undef when handed /proc/58517/root//usr/sbin/wsdd. But why?

$ sudo ls -l /proc/58517
total 0
...
lrwxrwxrwx  1 wsdd wsdd 0 Dec  6 16:56 cwd -> /run/wsdd
-r--------  1 wsdd wsdd 0 Dec  6 16:56 environ
lrwxrwxrwx  1 wsdd wsdd 0 Dec  6 16:56 exe -> /usr/bin/python3.10
...
lrwxrwxrwx  1 wsdd wsdd 0 Dec  6 16:56 root -> /run/wsdd
...
$ sudo ls -ld /run/wsdd
drwxr-xr-x 2 wsdd wsdd 40 Dec  6 16:56 /run/wsdd
$ sudo ls -la /run/wsdd
total 0
drwxr-xr-x  2 wsdd wsdd  40 Dec  6 16:56 .
drwxr-xr-x 28 root root 780 Dec  6 17:05 ..

Now we circle all the way back to my original report, which contained the wsdd systemd unit, ExecStart line:

ExecStart=/usr/sbin/wsdd --shortlog --chroot=/run/wsdd $WSDD_PARAMS

I strongly suspect if the chroot is removed, things will behave. Let's modify the systemd unit, daemon-reload, and find out:

$ sudo systemctl daemon-reload
$ sudo systemctl stop wsdd
$ sudo systemctl start wsdd
$ sudo systemctl status wsdd
● wsdd.service - Web Services Dynamic Discovery host daemon
     Loaded: loaded (/lib/systemd/system/wsdd.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-12-06 17:37:12 PST; 4s ago
       Docs: man:wsdd(8)
   Main PID: 59540 (python3)
      Tasks: 1 (limit: 9269)
     Memory: 16.0M
        CPU: 245ms
     CGroup: /system.slice/wsdd.service
             └─59540 python3 /usr/sbin/wsdd --shortlog -n XXX -w XXXX -v
...

And bingo: no longer does needrestart want to try and restart wsdd.

Summary: needrestart does not handle chroot'd processes very well. There are some blind assumptions being made in this code about what /proc/PID/root will contain, and the end result is a false positive for a service restart. I do not know how to detect this situation outside of the chroot itself (if you were inside of a chroot, you can detect it). Parsing command line arguments from /proc/PID/cmdline seems like a bad idea, and relying on /proc/PID/root or /prod/PID/cwd or /proc/PID/mount* information also seems error-prone and wasteful.

You might have dealt with this problem with non-interpreted processes using chroots already, as many daemons behave this way now?

koitsu commented 1 year ago

systemd unit says AmbientCapabilities=CAP_SYS_CHROOT as well. Per capabilities(7) man page:

   CAP_SYS_CHROOT
         * Use chroot(2);
         * change mount namespaces using setns(2).

... The /proc/[pid]/task/TID/status file can be used to view the capability sets of a thread. The /proc/[pid]/status file shows the capability sets of a process's main thread. Before Linux 3.8, nonexistent capabilities were shown as being enabled (1) in these sets. Since Linux 3.8, all nonexistent capabilities (above CAP_LAST_CAP) are shown as disabled (0).

So sounds like parsing /proc/PID/status CapEff field might do the trick:

$ sudo cat /proc/60115/status | grep ^Cap
CapInh: 0000000000040000
CapPrm: 0000000000040000
CapEff: 0000000000040000
CapBnd: 000001ffffffffff
CapAmb: 0000000000040000

When the chroot is in place, bit 18 (19th bit) is set (0x40000 = %1000000000000000000). The bits are defined per the kernel include file linux/capability.h which you can read here: https://github.com/torvalds/linux/blob/master/include/uapi/linux/capability.h#L228-L232 . Bitwise masks are no problem for Perl, so a simple if (CapEff & 0x40000) { chroot_in_place } else { no_chroot }.

koitsu commented 1 year ago

I did some more research on this, re: detecting chroots. My previous proposal of checking CapEff will not work. Please do not implement it that way. :)

"Classic" chroots -- meaning, ones which do not set any capabilities(7) -- will not have /proc/PID/status CapEff entries that show bit 18 set exclusively. Instead they'll have the same value as what's shown in CapBnd (which tends to be 0x1ffffffffff; per the man page, this is normal/expected). Non-chroot'd processes behave identically to this. CapAmb for processes in "classic" chroots, and non-chroots, will be 0x0.

Here's an example, where I've set up /tmp/foo to contain libs/etc. that /bin/bash and /bin/ls rely on so I can use it as a chroot:

$ sudo chroot /tmp/foo /bin/bash
/ #

While outside of the chroot in another window, I look for the related processes of the chroot and see what their Caps are:

$ ps -auxw | grep chroot
root       62394  0.0  0.0  14108  5784 pts/0    S    23:41   0:00 sudo chroot /tmp/foo /bin/bash
$ pstree -A -p 62394
sudo(62394)---bash(62395)
$ sudo cat /proc/62394/status | grep ^Cap
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
$ sudo cat /proc/62395/status | grep ^Cap
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000

That said: checking bit 18 of CapAmb would be better than nothing in situations like this, since it does accurately reflect the situation for services/programs which explicitly set it (like the wsdd systemd unit does). So while it doesn't solve the problem of detecting all chroot situations, it at least should work for cases like the one in this ticket.

A quick review of my system turns up only this process utilising this feature. My grep isn't doing bitmasking to only check bit 18, so it's error-prone, but it's still good enough. Additionally I check what CapAmb is of ALL the processes on my system to see what kind of variance there is. To me, it looks like doing a CapAmb check of bit 18 should be better than nothing.

$ for i in /proc/[0-9]*/status; do sudo grep -H -B4 '^CapAmb:.*40000' $i; done
/proc/60115/status-CapInh:      0000000000040000
/proc/60115/status-CapPrm:      0000000000040000
/proc/60115/status-CapEff:      0000000000040000
/proc/60115/status-CapBnd:      000001ffffffffff
/proc/60115/status:CapAmb:      0000000000040000
$ ps -auxw | grep 60115
wsdd       60115  0.0  0.3  38356 26456 ?        Ss   Dec06   0:00 python3 /usr/sbin/wsdd --shortlog --chroot=/run/wsdd -n XXX -w XXXX -v

$ (for i in /proc/[0-9]*/status; do sudo grep '^CapAmb:' $i; done) | sort | uniq -c
    132 CapAmb: 0000000000000000
      1 CapAmb: 0000000000002000
      1 CapAmb: 0000000000003c00
      1 CapAmb: 0000000000040000
      1 CapAmb: 0000000002000000
liske commented 1 year ago

IMHO taking the assumption that /proc/$PID/root points to the current's process fs view is correct, although the result of readlink might not tell what the process is currently accessing since it is just a pseudo symlink. But traversing through /proc/$PID/root does give access to the process fs view.

Sadly that does not help since the chroot/namespace operation can be done at different points in time:

The chroot/namespace thing is a problem fore a long time, there were several bug reports/issues due to postfix running in a chroot. I'm afraid there is no perfect solution that covers all cases.

koitsu commented 1 year ago

Understood. I've done my best to propose a solution (examining CapAmb for processes which meet this criteria), but for now I will simply use $nrconf{blacklist_rc} = [ qr(^wsdd\.service) ]; and call it a day. We can close this ticket out if you feel it worthy of closing.