ncsa / puppet-profile_update_os

NCSA Common Puppet Profiles - configure functionality for upgrading OS packages
0 stars 0 forks source link

kernel_upgrade.sh script can skip reboots even when it should #28

Open bsper2 opened 4 months ago

bsper2 commented 4 months ago

There are some bugs and a minor edge case in the kernel_upgrade.sh script, which would cause a system to not reboot when we'd normally want it to reboot.

Bug 1

PKG_UPDATES_REQ_REBOOT

LASTREBOOTDATE=`last reboot | head -1 | awk '{ print $5  " "  $6  " "  $7 }'`
LASTBOOTDATE=`date -d "$LASTREBOOTDATE" +"%d %b %Y"`
TODAYDATE=`date +"%d %b %Y"`
PKG_UPDATES_REQ_REBOOT=`rpm -qa --last | grep -B1000 "$LASTBOOTDATE" | grep -v "$LASTBOOTDATE" | egrep -i "$PKGS_REQ_REBOOT" |  wc -l`

Issue is that PKG_UPDATES_REQ_REBOOT will be 0 when the node was rebooted on a day when no packages were installed. Specifically the issue is here : rpm -qa --last | grep -B1000 "$LASTBOOTDATE". On stateless nodes it would be extremely uncommon for this to be a problem, but on stateful nodes this can definitely happen.

Bug 2

PKG_UPDATES

PKG_UPDATES=`rpm -qa --last | grep -B1000 "$LASTBOOTDATE" | grep -v "$LASTBOOTDATE" | wc -l`

PKG_UPDATES has the same issue as PKG_UPDATES_REQ_REBOOT

Bug 3

KERNEL_UPDATES_TODAY

KERNEL_UPDATES_TODAY=`rpm -qa --last | grep -B1000 "$TODAYDATE" | egrep -i 'kernel-[0-9]' | wc -l`

We set KERNEL_UPDATES_TODAY as above, and then later use this condition to see if a reboot is needed:

[[ $((KERNEL_UPDATES_TODAY)) -gt 1 ]]

Issue is that we grep for egrep -i 'kernel-[0-9]' which most of the time when there is a kernel update this will only be equal to 1 since the installed packages will look something like this:

kernel-4.18.0-553.5.1.el8_10.x86_64                     # Matches regex and is counted
kernel-modules-4.18.0-553.5.1.el8_10.x86_64      # Does not get counted
kernel-core-4.18.0-553.5.1.el8_10.x86_64             # Does not get counted

So we probably want the condition to be something like this instead:

[[ $((KERNEL_UPDATES_TODAY)) -ge 1 ]]

Edge Case 1

PKG_UPDATES_TODAY

This one is definitely just an edge case rather than a bug. But just want to mention it since it caused issues in SVC-24690. As long as we fix the three bugs above leaving this edge case unchanged is probably not a big deal.

If packages are updated on at least the day before the scheduled run of this script, this script won't trigger an auto reboot if the # of packages the scheduled script updates is less than 5 (though that threshold # can be changed on CLI). Again, this part of the script does work as intended so it's probably not a huge deal.