Closed cadejager closed 8 years ago
The checkjob_getNodeList script is where the jobid is found. Sometimes I get the following message:
$ $PVINSTALL/PAV/scripts/checkjob_getNodeList $SLURM_JOBID
ERROR: server rejected request - redirectport=41560
INFO: connection failed
2016-09-02T10:56:32.664-0600 40633 ERROR MClient.c:MCSendToServer:2531 0 communication error cluster-masater:42559 (redirectport=41560)
checkjob_getNodeList is failing when it calls checkjob. My current plan is to make checkjob_getNodeList retry calling checkjob a few times if the call fails. However, I am going to wait to do this until we confirm that we cannot just cleanup our moab/slurm configuration to solve the problem.
This has been fixed by commit ee71e8243547e189debeeddefe0ec646881ab063
Sometimes we get tests that run and show up with 0 nodes reported by get_results. This number is calculated by reading from the log file in the results directory for the nodes line. This line is blank in the tests that have 0x36.
hpl-intel.0x36() - total_runs:30, passed:15, failed:0, undefined:15, incomplete:0, [753.97 secs] hpl-intel.1x36() - total_runs:668, passed:533, failed:0, undefined:135, incomplete:0, [1202.50 secs]