Now qsys for PBS/Torque schedulers get the actual job_id

strazto commented 4 years ago

Aims to resolve #186

I've performed the neccessary changes to extract the actual job id of the submitted worker / worker array for PBS systems, and I've made the assumption ( with a small amount of research ) that they also apply to SGE, and implemented it for the SGE superclass.

That said, I don't have an SGE scheduler, & the man page wasnt explicit about what exactly the stdout return of qsub is for SGE, but in any case, the man page DOES suggest that for SGE systems, JOB_NAME and JOB_ID are distinct values.

mschubert commented 4 years ago

I find the SGE documentation a bit ambiguous, because scancel is stated to require a job identifier, which may or may not be covered by JOB_NAME.

Pinging @wlandau, if I remember correctly you are using SGE.

Does forced job cleanup work for you in the current implementation? (e.g., are jobs cancelled with Ctrl+C?)

wlandau-lilly commented 4 years ago

Does forced job cleanup work for you in the current implementation? (e.g., are jobs cancelled with Ctrl+C?)

Unfortunately not. Script:

# issue187.R
f <- function(x) Sys.sleep(120)
options(clustermq.scheduler = "sge", clustermq.template = "sge_cmq.tmpl")
library(clustermq)
Q(f, x = seq_len(4), n_jobs = 4)

Template file:

# sge_cmq.tmpl
# From https://github.com/mschubert/clustermq/wiki/SGE
#$ -N {{ job_name }}               # job name
#$ -t 1-{{ n_jobs }}               # submit jobs as array
#$ -j y                            # combine stdout/error in one file
#$ -o {{ log_file | /dev/null }}   # output file
#$ -cwd                            # use pwd as work dir
#$ -V                              # use environment variable
#$ -pe smp 1                       # request 1 core per job
#$ -l h_vmem=16G
#ulimit -v $(( 1024 * {{ memory | 4096 }} ))
ulimit -v unlimited
module load R
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

I ran the script in the terminal, let the jobs initialize, and then killed the master process with CTRL-C. In qstat, I saw the jobs continue running for a full two minutes later before I manually killed them with qdel.

$ Rscript issue187.R
$ Submitting 4 worker jobs (ID: 7610) ...
Running 4 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
[----------------------------------------------------]   0% (4/4 wrk) eta:  ?s^C
Error in rzmq::poll.socket(list(private$socket), list("read"), timeout = msec) : 
  The operation was interrupted by delivery of a signal before any events were available.
Calls: Q -> Q_rows -> master -> <Anonymous> -> <Anonymous>
sh: -c: line 0: syntax error near unexpected token `('
sh: -c: line 0: `qdel Your job-array 61722135.1-4:1 ("cmq7610") has been submitted &'
Execution halted

strazto commented 4 years ago

Does forced job cleanup work for you in the current implementation? (e.g., are jobs cancelled with Ctrl+C?)

By "current implementation" you mean that of my PR, right?

Assuming @wlandau is using my fork, I see from their output, that the STDOUT return of qsub under SGE is an unstructured human readable message.

This suggests that extracting job_id from the qsub STDOUT is more specific to PBS, so I should move that implementation from the SGE superclass.

mschubert commented 4 years ago

SGE should probably have the same, but your regex is not catching its format (I know that LSF does too). I just don't know how stable this human-readable output is with version changes, so I prefer a documented API if possible.

As far as I understand, @wlandau tested your PR (because the error message looks very much like it). @wlandau, did Ctrl+C work with the current master? (to be specific: did the PR break it?)

strazto commented 4 years ago

SGE should probably have the same, but your regex is not catching its format

There's actually no regex, it just captures STDOUT in it's entirety, that's the specified format of PBS (Pro 13) qsub

I just don't know how stable this human-readable output is with version changes, so I prefer a documented API if possible.

It bothers me that the developers chose to use such a useful tool for IPC as STDOUT to encode the message in an unstructured format, when something as simple as returning the job id gives all the information a user needs, given the context, but oh well

In the case of PBS, the documented API is very simple and user friendly- simply the jobs' I'd

Perhaps if we continue to capture STDOUT unmodified as I have, we can provide a method get_job_id(qsub_stdout) and subclass SGE for versions with different APIS?

wlandau commented 4 years ago

Confirmed: using https://github.com/mschubert/clustermq/commit/e7c68edc2e3b0ac390c6b5eb48032ecdba34ebe6, CTRL-C on the master process from https://github.com/mschubert/clustermq/pull/187#issuecomment-578577370 kills the SGE workers.

strazto commented 4 years ago

Thanks for the information @wlandau - it looks like I'll be moving the implementation out of the SGE class, to PBS, in that case

strazto commented 4 years ago

Torque and PBS are a little bit more similar, I think, so I'd be curious to know whether this applies to torque as well

mschubert commented 4 years ago

@mstr3336 Yes, it looks like it's better to move your solution to the derived PBS class. It would also be good to make sure Torque works, but I'm not aware of any user that has contacted me about it.

strazto commented 4 years ago

@wlandau , would you be able to test the SGE implementation ? I've added a private field, qsub_stdout to SGE, and added a private method set_job_id() to SGE, that uses private$job_name to set private$job_id.

In the PBS subclass, set_job_id() is overridden to use qsub_stdout to set the job_id.

wlandau commented 4 years ago

Re https://github.com/mschubert/clustermq/pull/187#issuecomment-580092526, 21ae2fa946ad99d89f51008e86728af6e3637974 works for me now. The workers now terminate when I interrupt the master.

strazto commented 4 years ago

@mschubert , I've done everything except drop qsub_stdout as a private field.

I also fixed a logical error in status checkout the qsub call.

Also, I made torque inherit PBS, but I'm on the fence about the correctness of this choice, so if you think it would be better form to implement the same set_job_id() as PBS separately for torque, I'm happy to.

mschubert / clustermq

Now qsys for PBS/Torque schedulers get the actual job_id #187