PBS Batch jobs don't clean-up properly (Illegal job identifier) (PBSPro_13.1.0.160576)

strazto commented 4 years ago

Running PBS Pro 13.1.0.16056 , having modified the submission templates as per #184 , and my PR #185 , I've noticed that the worker nodes don't clean-up neatly when the master node terminates.

When submitting using drake, upon failure of a target, the workflow is supposed to stop, and the workers terminated.

When a target fails, I subsequently get the following error:

qdel: illegally formed job identifier: cmq7082

This corresponds to the job name for the job array, given by the socket of (I assume) the first worker in the array (or maybe the master).

When I examine the output of qstat -u mstr3336 -x , I see the following:

pbsserver:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3965541.pbsserv mstr3336 small    run_make_h  60985   1   1   16gb 23:59 F 00:44
3965544[].pbsse mstr3336 small    cmq7082             --    1   1    4gb 23:59 F   --

We see that the jobID of our batch array is 3965544[] , and the job name was indeed given by our submission script.

Referring to the SGE child class:

https://github.com/mschubert/clustermq/blob/e7c68edc2e3b0ac390c6b5eb48032ecdba34ebe6/R/qsys_sge.r#L26-L38

We see that the finalize function calls qdel on job_id, which seems okay, but looking closer at the submit jobs implementation:

https://github.com/mschubert/clustermq/blob/e7c68edc2e3b0ac390c6b5eb48032ecdba34ebe6/R/qsys_sge.r#L14-L17

job_id is simply given by job_name.

(job_name is inherited from the following:

https://github.com/mschubert/clustermq/blob/e7c68edc2e3b0ac390c6b5eb48032ecdba34ebe6/R/qsys.r#L221-L237

)

Uh oh! This is not in concordance with the PBS specs:

PBS Professional 18.2 User’s Guide UG-13

Excerpt from PBS Guide

> ## Submitting a PBS Job Chapter 2 > > ### 2.1.3 The Job Identifier > > After you submit a job, PBS returns a job identifier. Format for a job: > `.` > > Format for a job array: > > `[]..` > > You’ll need the job identifier for any actions involving the job, such as checking job status, modifying the job, tracking the job, or deleting the job

Additionally, the environment variable PBS_JOBID is exposed for the .pbs script.

So it's clear that either:

the return from the qsub for the batch job is needed, or
the PBS_JOBID somehow needs to be sent back to master.

My intuition tells me that getting the return of qsub is the simpler option, though given the following:

https://github.com/mschubert/clustermq/blob/e7c68edc2e3b0ac390c6b5eb48032ecdba34ebe6/R/qsys_sge.r#L19-L22

The result of system(...) is the command's error status.

After checking the man page for system, we can see that by setting intern = TRUE, and then doing a little extra work to retrieve the command output, we are able to access both.

I'll experiment with this, and then put in a PR if all goes well

strazto commented 4 years ago

@mschubert , would you be able to review my PR regarding this?

mschubert commented 4 years ago

For completeness, rest of discussion is in https://github.com/mschubert/clustermq/pull/187

mschubert / clustermq

PBS Batch jobs don't clean-up properly (Illegal job identifier) (PBSPro_13.1.0.160576) #186