Closed jouvin closed 10 years ago
Too quick as usual! I just realized that in moving to CAF::Process, I made something wrong and that ncm-ncd output is no longer logged into ncm-cdispd
log file. I'll fix it asap...
Fixed!
Thanks for the fix, will merge after 14.6 final.
No please, BEFORE. This is a very serious issue, in fact lasting since some times. We have been hitted by this several times, with machines in an inconsistent state that required manual intervention on a lot of machines...
The deadline has to fall somewhere (reminder: this release is due today), If you had found this bug before the merge freeze then I would happily merge it. A release isn't supposed to be perfect, but a known quantity, the current buggy behaviour is broken, but understood, can you guarantee that your changes don't introduce a new bug?
However, if more than site needs this fixed now I will consider merging it, this will also require that someone who understands the code-base better than me (@piojo or @stdweird) reviews this patch.
In principle, I completely agree with you. But the release date is just our decision and discovery of a serious bug may need that we reconsider it... I'm really asking for an exception here. I agree that we need some reviews of the change before merging it...
I cannot guarantee anything except that I tested it a lot on a real system with the different combination of ncm-ncd
error or success. And I tried to described as much as possible my analysis of the problem to help understand the fix (which is in fact pretty simple).
Ok, 14.6 is delayed. If we can't release by Friday (2014-07-04) then it's aborted and we work on 14.7.
I'll review this tonight.
lfmunozmejias@gmail.com
Thanks.
@piojo Thanks for your comments. I have seen that there were some other inconsistencies but as we were already in the RC process I tried to do the very minimal changes required (and usage of CAF::Process instead of backquotes). I'll implement your suggestions.
I just committed the changes suggested by @piojo. I had no time to test them yet, I plan to do it at the end of the morning. Others can give it a try, as the changes are really cosmetic compared to the last version tested.
With the removed check about ncm-ncd
existence, CAF::Process exits with an uncaught exception:
Uncaught exception!!! Calling stack is:
LC::Exception::throw_error called at /usr/lib/perl/LC/Process.pm line 796
LC::Process::execute called at /usr/lib/perl/LC/Process.pm line 823
LC::Process::output called at /usr/lib/perl/CAF/Process.pm line 204
CAF::Process::output called at /usr/sbin/ncm-cdispd line 1001
main::launch_ncd called at /usr/sbin/ncm-cdispd line 561
main::init_components called at /usr/sbin/ncm-cdispd line 1070
*** exec(/usr/sbin/ncm-ncd, --configure, profile, network, vomsclient, aiiserver, altlogrotate, dirperm, modprobe, dpmlfc, spma, cdp, ldconf, xrootd, iptables, ntpd, sudo, grub, ccm, hostsaccess, sysconfig, named, accounts, filecopy, mkgridmap, gip2, mysql, cron, etcservices, nrpe, chkconfig, gridmapdir, useraccess, lcgbdii, --state, /var/run/quattor-components): No such file or directory
Is it expected?
@jouvin how did you get the stacktrace. was this on a system with ncm-ncd missing?
For my test, I just renamed /usr/sbin/ncm-ncd
on a system where everything is installed. So just the command is missing, every else is there.
@jouvin what do expect to happen with missing executable? the presence should be forced via the packager (either shipped with the package or with a dependency set). or do you anticipate different location for the executable?
From @piojo's comment, I thought that it was handle in some way by CAF::Process. Sorry for my misunderstanding. I'll restore the test for existence, just not exiting in case it is missing.
Sorry for closing this by mistake... reopened!
The last content of this PR has been EXTENSIVELY tested at GRIF, with I think all possible combinations of error conditions at ncm-cdispd
startup or during the life of the daemon. I did a significant cleanup in error messages and documentation but refrained to do a full review of the code. This may be for another release/PR.
Please test and comment!
I know the identation is not compliant with our standard, this is not my fault! This script has been written mainly with a 2 space identation... Will not prevent a merge I hope...
Apart from fixing identation, I also fixed a documented race condition that could happen if a component was present in the config and failing. If the next profile had the component removed, the component was staying on the list of component to run because of its failure at the previous run. The only case where it can still happen is:
ncm-ncd
should ignore it and this is probably rare enough to avoid processing this in ncm-dispd
.Tested and retested ad nauseam at GRIF...
Thanks! I'll leave this for wiser gurus to merge when they are happy.
Small remarks inlined in the commits. It's almost ready to merge.
Thanks. I'm trying to do it early this morning.
I think I addressed all @piojo comments (except the request to merge lines that I didn't identify but I think this is very minor and can be let for further cleanups!). Tested at GRIF (including a SL5 machine after last @piojo remarks). As for me, ready for merging. If a RC3 is out this afternoon, I may give it a try.
I presume @piojo meant lines 680:681, so that
}
else {
becomes
} else {
Yes, that what I meant.
Also, this is good to merge now. Heading to the ncm-grub PR...
The ncm-grub PR was merged. Merging this one too.
Ha! Missed it! This is the result of Eclipse reformatting. I fixed most of them but missed this one. Anyway, I'll open a new issue for 14.8 with additional code cleanups that may be useful.
CAF::Process
to execute ncm-ncdFixes #4