Running PBS Pro 13.1.0.16056 , having modified the submission templates as per #184 , and my PR #185 , I've noticed that the worker nodes don't clean-up neatly when the master node terminates.
When submitting using drake, upon failure of a target, the workflow is supposed to stop, and the workers terminated.
When a target fails, I subsequently get the following error:
qdel: illegally formed job identifier: cmq7082
This corresponds to the job name for the job array, given by the socket of (I assume) the first worker in the array (or maybe the master).
When I examine the output of qstat -u mstr3336 -x , I see the following:
pbsserver:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3965541.pbsserv mstr3336 small run_make_h 60985 1 1 16gb 23:59 F 00:44
3965544[].pbsse mstr3336 small cmq7082 -- 1 1 4gb 23:59 F --
We see that the jobID of our batch array is 3965544[] , and the job name was indeed given by our submission script.
Excerpt from PBS Guide
> ## Submitting a PBS Job Chapter 2
>
> ### 2.1.3 The Job Identifier
>
> After you submit a job, PBS returns a job identifier. Format for a job:
> `.`
>
> Format for a job array:
>
> `[]..`
>
> You’ll need the job identifier for any actions involving the job, such as checking job status, modifying the job, tracking the job, or deleting the job
Additionally, the environment variable PBS_JOBID is exposed for the .pbs script.
So it's clear that either:
the return from the qsub for the batch job is needed, or
the PBS_JOBID somehow needs to be sent back to master.
My intuition tells me that getting the return of qsub is the simpler option, though given the following:
The result of system(...) is the command's error status.
After checking the man page for system, we can see that by setting intern = TRUE, and then doing a little extra work to retrieve the command output, we are able to access both.
I'll experiment with this, and then put in a PR if all goes well
Running PBS Pro 13.1.0.16056 , having modified the submission templates as per #184 , and my PR #185 , I've noticed that the worker nodes don't clean-up neatly when the master node terminates.
When submitting using
drake
, upon failure of a target, the workflow is supposed to stop, and the workers terminated.When a target fails, I subsequently get the following error:
This corresponds to the
job name
for the job array, given by the socket of (I assume) the first worker in the array (or maybe the master).When I examine the output of
qstat -u mstr3336 -x
, I see the following:We see that the jobID of our batch array is 3965544[] , and the job name was indeed given by our submission script.
Referring to the SGE child class:
https://github.com/mschubert/clustermq/blob/e7c68edc2e3b0ac390c6b5eb48032ecdba34ebe6/R/qsys_sge.r#L26-L38
We see that the
finalize
function callsqdel
onjob_id
, which seems okay, but looking closer at thesubmit jobs
implementation:https://github.com/mschubert/clustermq/blob/e7c68edc2e3b0ac390c6b5eb48032ecdba34ebe6/R/qsys_sge.r#L14-L17
job_id
is simply given byjob_name
.(
job_name
is inherited from the following:https://github.com/mschubert/clustermq/blob/e7c68edc2e3b0ac390c6b5eb48032ecdba34ebe6/R/qsys.r#L221-L237
)
Uh oh! This is not in concordance with the PBS specs:
PBS Professional 18.2 User’s Guide UG-13
Excerpt from PBS Guide
> ## Submitting a PBS Job Chapter 2 > > ### 2.1.3 The Job Identifier > > After you submit a job, PBS returns a job identifier. Format for a job: > `Additionally, the environment variable
PBS_JOBID
is exposed for the .pbs script.So it's clear that either:
qsub
for the batch job is needed, orPBS_JOBID
somehow needs to be sent back to master.My intuition tells me that getting the return of qsub is the simpler option, though given the following:
https://github.com/mschubert/clustermq/blob/e7c68edc2e3b0ac390c6b5eb48032ecdba34ebe6/R/qsys_sge.r#L19-L22
The result of
system(...)
is the command's error status.After checking the man page for
system
, we can see that by settingintern = TRUE
, and then doing a little extra work to retrieve the command output, we are able to access both.I'll experiment with this, and then put in a PR if all goes well