quattor / ncm-cdispd

Node Configuration Manager Configuration Dispatch Daemon
www.quattor.org
Other
3 stars 6 forks source link

Postinstall script restarts service, which re-runs ncm-ncd part-way through an upgrade. #50

Closed jrha closed 7 years ago

jrha commented 7 years ago

ncm-cdispd is restarted by the post-install script during an upgrade, at the point at which the restart happens the system is in an inconsistent state, where some packages have not yet been upgraded — cue scary "bad Perl code" warnings.

Here are sanitised logs for sl6x and sl7x VMs showing the problem.

As mentioned in quattor/release#297.

stdweird commented 7 years ago

to be clear: the sl6 logs shows a different issue as the sl7 logs. i knew about the sl6 one, i didn't about the sl7 (or i always assume the sl7 error was the sl6 one).

the sl7 error is a very strange beast (is it reading a partially updated file? how would that even be possible...), but is probably due to systemd handling the restart via ncm-chkconfig as opposed to the rpm postinstall via chkconfig

jrha commented 7 years ago

Also to be clear, the system eventually gets updated correctly and works, but it's a confusing mess in the middle and it scares me to think what might happen if something got stuck.

stdweird commented 7 years ago

@jrha we deployed 17.3.0 release rpms, and are hit by permament broken ncm-cdispd.

jrha commented 7 years ago

:fearful:

jrha commented 7 years ago

It would be really good to understand what happened, can you tell why ncm-cdispd is broken? It was functioning correctly after the upgrade on the test hosts we used.

stdweird commented 7 years ago

can you send me a md5sum of the /usr/sbin/ncm-cdispd? and rpm -qa|grep perl

stdweird commented 7 years ago

@jrha on a sl7/centos7 system

stdweird commented 7 years ago

@jrha these are actual code bugs, introduced in https://github.com/quattor/ncm-cdispd/pull/48/commits/56471eb366e0a25293157c24f5bba0409e637fdf

a few things went wrong:

  1. it looks like we (at UGent i mean) did not ever test this, opposed to what i claimed. i seem to have never pushed the rpm to the repo...
  2. perlcritic doesn't see errors like this (incl the syntax error)
  3. our unittests can't load scripts, so another test path that should really capture this didn't catch this

i'm really puzzled how this can have worked at RAL.

anyway, PR coming up. either make a 17.3.1 for this or revoke the rpm from the release

jrha commented 7 years ago

The original test VMs have been destroyed, but a recently upgraded VM does indeed have a broken ncm-cdispd. Not sure how I didn't notice. :disappointed: I'm going to delete the RPM from the repo to prevent anyone else breaking their systems and issue a 17.3.1.

stdweird commented 7 years ago

Fixes are

jrha commented 7 years ago

@stdweird was this really fixed by your PR? The postinstall script is still doing confusing things is it not?

stdweird commented 7 years ago

@jrha the EL7 things should be gone. what are you still seeing?

jrha commented 7 years ago

The issue describes the restart of cdispd by the postinstall script, if that is not problematic then the issue title and description should be updated.

stdweird commented 7 years ago

afaik, the restart is not the issue