lanl / Pavilion

HPC testing harness
BSD 3-Clause "New" or "Revised" License
16 stars 12 forks source link

Sometimes tests show up with 0 nodes in get_results #14

Closed cadejager closed 8 years ago

cadejager commented 8 years ago

Sometimes we get tests that run and show up with 0 nodes reported by get_results. This number is calculated by reading from the log file in the results directory for the nodes line. This line is blank in the tests that have 0x36.

hpl-intel.0x36() - total_runs:30, passed:15, failed:0, undefined:15, incomplete:0, [753.97 secs] hpl-intel.1x36() - total_runs:668, passed:533, failed:0, undefined:135, incomplete:0, [1202.50 secs]

cadejager commented 8 years ago

The checkjob_getNodeList script is where the jobid is found. Sometimes I get the following message:

$ $PVINSTALL/PAV/scripts/checkjob_getNodeList $SLURM_JOBID
ERROR:    server rejected request - redirectport=41560
INFO:     connection failed
2016-09-02T10:56:32.664-0600    40633   ERROR   MClient.c:MCSendToServer:2531   0           communication error cluster-masater:42559 (redirectport=41560)
cadejager commented 8 years ago

checkjob_getNodeList is failing when it calls checkjob. My current plan is to make checkjob_getNodeList retry calling checkjob a few times if the call fails. However, I am going to wait to do this until we confirm that we cannot just cleanup our moab/slurm configuration to solve the problem.

cadejager commented 8 years ago

This has been fixed by commit ee71e8243547e189debeeddefe0ec646881ab063