quattor / ncm-cdispd

Node Configuration Manager Configuration Dispatch Daemon
www.quattor.org
Other
3 stars 6 forks source link

Fix ncm-cdispd forgetting about failed configuration modules #6

Closed jouvin closed 10 years ago

jouvin commented 10 years ago

Fixes #4

jouvin commented 10 years ago

Too quick as usual! I just realized that in moving to CAF::Process, I made something wrong and that ncm-ncd output is no longer logged into ncm-cdispd log file. I'll fix it asap...

jouvin commented 10 years ago

Fixed!

jrha commented 10 years ago

Thanks for the fix, will merge after 14.6 final.

jouvin commented 10 years ago

No please, BEFORE. This is a very serious issue, in fact lasting since some times. We have been hitted by this several times, with machines in an inconsistent state that required manual intervention on a lot of machines...

jrha commented 10 years ago

The deadline has to fall somewhere (reminder: this release is due today), If you had found this bug before the merge freeze then I would happily merge it. A release isn't supposed to be perfect, but a known quantity, the current buggy behaviour is broken, but understood, can you guarantee that your changes don't introduce a new bug?

However, if more than site needs this fixed now I will consider merging it, this will also require that someone who understands the code-base better than me (@piojo or @stdweird) reviews this patch.

jouvin commented 10 years ago

In principle, I completely agree with you. But the release date is just our decision and discovery of a serious bug may need that we reconsider it... I'm really asking for an exception here. I agree that we need some reviews of the change before merging it... I cannot guarantee anything except that I tested it a lot on a real system with the different combination of ncm-ncd error or success. And I tried to described as much as possible my analysis of the problem to help understand the fix (which is in fact pretty simple).

jrha commented 10 years ago

Ok, 14.6 is delayed. If we can't release by Friday (2014-07-04) then it's aborted and we work on 14.7.

piojo-zz commented 10 years ago

I'll review this tonight.

lfmunozmejias@gmail.com

jrha commented 10 years ago

Thanks.

jouvin commented 10 years ago

@piojo Thanks for your comments. I have seen that there were some other inconsistencies but as we were already in the RC process I tried to do the very minimal changes required (and usage of CAF::Process instead of backquotes). I'll implement your suggestions.

jouvin commented 10 years ago

I just committed the changes suggested by @piojo. I had no time to test them yet, I plan to do it at the end of the morning. Others can give it a try, as the changes are really cosmetic compared to the last version tested.

jouvin commented 10 years ago

With the removed check about ncm-ncd existence, CAF::Process exits with an uncaught exception:

Uncaught exception!!! Calling stack is:
  LC::Exception::throw_error called at /usr/lib/perl/LC/Process.pm line 796
  LC::Process::execute called at /usr/lib/perl/LC/Process.pm line 823
  LC::Process::output called at /usr/lib/perl/CAF/Process.pm line 204
  CAF::Process::output called at /usr/sbin/ncm-cdispd line 1001
  main::launch_ncd called at /usr/sbin/ncm-cdispd line 561
  main::init_components called at /usr/sbin/ncm-cdispd line 1070
*** exec(/usr/sbin/ncm-ncd, --configure, profile, network, vomsclient, aiiserver, altlogrotate, dirperm, modprobe, dpmlfc, spma, cdp, ldconf, xrootd, iptables, ntpd, sudo, grub, ccm, hostsaccess, sysconfig, named, accounts, filecopy, mkgridmap, gip2, mysql, cron, etcservices, nrpe, chkconfig, gridmapdir, useraccess, lcgbdii, --state, /var/run/quattor-components): No such file or directory

Is it expected?

stdweird commented 10 years ago

@jouvin how did you get the stacktrace. was this on a system with ncm-ncd missing?

jouvin commented 10 years ago

For my test, I just renamed /usr/sbin/ncm-ncd on a system where everything is installed. So just the command is missing, every else is there.

stdweird commented 10 years ago

@jouvin what do expect to happen with missing executable? the presence should be forced via the packager (either shipped with the package or with a dependency set). or do you anticipate different location for the executable?

jouvin commented 10 years ago

From @piojo's comment, I thought that it was handle in some way by CAF::Process. Sorry for my misunderstanding. I'll restore the test for existence, just not exiting in case it is missing.

jouvin commented 10 years ago

Sorry for closing this by mistake... reopened!

jouvin commented 10 years ago

The last content of this PR has been EXTENSIVELY tested at GRIF, with I think all possible combinations of error conditions at ncm-cdispd startup or during the life of the daemon. I did a significant cleanup in error messages and documentation but refrained to do a full review of the code. This may be for another release/PR. Please test and comment!

jouvin commented 10 years ago

I know the identation is not compliant with our standard, this is not my fault! This script has been written mainly with a 2 space identation... Will not prevent a merge I hope...

jouvin commented 10 years ago

Apart from fixing identation, I also fixed a documented race condition that could happen if a component was present in the config and failing. If the next profile had the component removed, the component was staying on the list of component to run because of its failure at the previous run. The only case where it can still happen is:

Tested and retested ad nauseam at GRIF...

jrha commented 10 years ago

Thanks! I'll leave this for wiser gurus to merge when they are happy.

piojo-zz commented 10 years ago

Small remarks inlined in the commits. It's almost ready to merge.

jouvin commented 10 years ago

Thanks. I'm trying to do it early this morning.

jouvin commented 10 years ago

I think I addressed all @piojo comments (except the request to merge lines that I didn't identify but I think this is very minor and can be let for further cleanups!). Tested at GRIF (including a SL5 machine after last @piojo remarks). As for me, ready for merging. If a RC3 is out this afternoon, I may give it a try.

jrha commented 10 years ago

I presume @piojo meant lines 680:681, so that

         }
         else {

becomes

         } else {
piojo-zz commented 10 years ago

Yes, that what I meant.

Also, this is good to merge now. Heading to the ncm-grub PR...

piojo-zz commented 10 years ago

The ncm-grub PR was merged. Merging this one too.

jouvin commented 10 years ago

Ha! Missed it! This is the result of Eclipse reformatting. I fixed most of them but missed this one. Anyway, I'll open a new issue for 14.8 with additional code cleanups that may be useful.