Closed strazto closed 4 years ago
I find the SGE documentation a bit ambiguous, because scancel
is stated to require a job identifier, which may or may not be covered by JOB_NAME
.
Pinging @wlandau, if I remember correctly you are using SGE.
Does forced job cleanup work for you in the current implementation? (e.g., are jobs cancelled with Ctrl+C?)
Does forced job cleanup work for you in the current implementation? (e.g., are jobs cancelled with Ctrl+C?)
Unfortunately not. Script:
# issue187.R
f <- function(x) Sys.sleep(120)
options(clustermq.scheduler = "sge", clustermq.template = "sge_cmq.tmpl")
library(clustermq)
Q(f, x = seq_len(4), n_jobs = 4)
Template file:
# sge_cmq.tmpl
# From https://github.com/mschubert/clustermq/wiki/SGE
#$ -N {{ job_name }} # job name
#$ -t 1-{{ n_jobs }} # submit jobs as array
#$ -j y # combine stdout/error in one file
#$ -o {{ log_file | /dev/null }} # output file
#$ -cwd # use pwd as work dir
#$ -V # use environment variable
#$ -pe smp 1 # request 1 core per job
#$ -l h_vmem=16G
#ulimit -v $(( 1024 * {{ memory | 4096 }} ))
ulimit -v unlimited
module load R
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
I ran the script in the terminal, let the jobs initialize, and then killed the master process with CTRL-C. In qstat
, I saw the jobs continue running for a full two minutes later before I manually killed them with qdel
.
$ Rscript issue187.R
$ Submitting 4 worker jobs (ID: 7610) ...
Running 4 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
[----------------------------------------------------] 0% (4/4 wrk) eta: ?s^C
Error in rzmq::poll.socket(list(private$socket), list("read"), timeout = msec) :
The operation was interrupted by delivery of a signal before any events were available.
Calls: Q -> Q_rows -> master -> <Anonymous> -> <Anonymous>
sh: -c: line 0: syntax error near unexpected token `('
sh: -c: line 0: `qdel Your job-array 61722135.1-4:1 ("cmq7610") has been submitted &'
Execution halted
Does forced job cleanup work for you in the current implementation? (e.g., are jobs cancelled with Ctrl+C?)
By "current implementation" you mean that of my PR, right?
Assuming @wlandau is using my fork, I see from their output, that the STDOUT return of qsub under SGE is an unstructured human readable message.
This suggests that extracting job_id from the qsub STDOUT is more specific to PBS, so I should move that implementation from the SGE superclass.
SGE should probably have the same, but your regex is not catching its format (I know that LSF does too). I just don't know how stable this human-readable output is with version changes, so I prefer a documented API if possible.
As far as I understand, @wlandau tested your PR (because the error message looks very much like it). @wlandau, did Ctrl+C work with the current master? (to be specific: did the PR break it?)
SGE should probably have the same, but your regex is not catching its format
There's actually no regex, it just captures STDOUT in it's entirety, that's the specified format of PBS (Pro 13) qsub
I just don't know how stable this human-readable output is with version changes, so I prefer a documented API if possible.
It bothers me that the developers chose to use such a useful tool for IPC as STDOUT to encode the message in an unstructured format, when something as simple as returning the job id gives all the information a user needs, given the context, but oh well
In the case of PBS, the documented API is very simple and user friendly- simply the jobs' I'd
Perhaps if we continue to capture STDOUT unmodified as I have, we can provide a method get_job_id(qsub_stdout)
and subclass SGE for versions with different APIS?
Confirmed: using https://github.com/mschubert/clustermq/commit/e7c68edc2e3b0ac390c6b5eb48032ecdba34ebe6, CTRL-C on the master process from https://github.com/mschubert/clustermq/pull/187#issuecomment-578577370 kills the SGE workers.
Thanks for the information @wlandau - it looks like I'll be moving the implementation out of the SGE class, to PBS, in that case
Torque and PBS are a little bit more similar, I think, so I'd be curious to know whether this applies to torque as well
@mstr3336 Yes, it looks like it's better to move your solution to the derived PBS class. It would also be good to make sure Torque works, but I'm not aware of any user that has contacted me about it.
@wlandau , would you be able to test the SGE implementation ?
I've added a private field, qsub_stdout
to SGE
, and added a private method set_job_id()
to SGE, that uses private$job_name
to set private$job_id
.
In the PBS subclass, set_job_id()
is overridden to use qsub_stdout
to set the job_id.
Re https://github.com/mschubert/clustermq/pull/187#issuecomment-580092526, 21ae2fa946ad99d89f51008e86728af6e3637974 works for me now. The workers now terminate when I interrupt the master.
@mschubert , I've done everything except drop qsub_stdout
as a private field.
I also fixed a logical error in status checkout the qsub call.
Also, I made torque inherit PBS, but I'm on the fence about the correctness of this choice, so if you think it would be better form to implement the same set_job_id()
as PBS separately for torque, I'm happy to.
Aims to resolve #186
I've performed the neccessary changes to extract the actual job id of the submitted worker / worker array for PBS systems, and I've made the assumption ( with a small amount of research ) that they also apply to SGE, and implemented it for the SGE superclass.
That said, I don't have an SGE scheduler, & the man page wasnt explicit about what exactly the stdout return of
qsub
is for SGE, but in any case, the man page DOES suggest that for SGE systems,JOB_NAME
andJOB_ID
are distinct values.