mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
171 stars 51 forks source link

Control job name (as passed to scheduler) #121

Closed HenrikBengtsson closed 7 years ago

HenrikBengtsson commented 7 years ago

Using TORQUE / PBS as an example, one can use:

#PBS -N <%= job.hash %>

in the template file to make the name of the job to the be the batchtools job hash. This makes the job hash to show up as the name in the qstat output.

I'd like to use a custom, more informative job name that I can control via R and that is not hard coded in the template. In BatchJobs you could set a job prefix and then BatchJobs appended the job ID index. With batchtools, I wish do something similar. I'm trying to figure out a way to use another string than the job.hash for this but cannot really figure how to do it without it becoming a hacking, e.g. somehow passing it viaresources`.

Any suggestions?

mllg commented 7 years ago

Do you want to set the same prefix for all jobs of a registry or do you need more flexibility and want to have individual job names?

I would rather not restore the old naming scheme ([reg.id]_[job.id]) because this can lead to duplicated job names (start a big chunk with first job id = 1, after the first job terminated submit job with id=1 again -> clash)). On the other hand, I'm not sure if any scheduler would complain about this.

I could instead extend the JobCollection to have a slot job.name which you could set via submitJobs, with a fallback to the job hash.

HenrikBengtsson commented 7 years ago

Do you want to set the same prefix for all jobs of a registry or do you need more flexibility and want to have individual job names?

For my case (future.batchtools), for now I'd be happy to set it per registry, and then similarly to BatchJobs have a suffix index appended, e.g. <job.name> := <reg.name>-<job.id>.

However, ideally it could be set per job, e.g. <job.name>. For instance, this would allow you to use job names such as sample_a-chr01, sample_a-chr02, and sample_b-chr21. In the future.batchtools world, this would be controlled as:

x %<-% { process_data(x, y) } %label% sprintf("%s-chr%02d", sample, chr)

I would rather not restore the old naming scheme ([reg.id]_[job.id]) because this can lead to duplicated job names (start a big chunk with first job id = 1, after the first job terminated submit job with id=1 again -> clash)). On the other hand, I'm not sure if any scheduler would complain about this.

Actually never though about that as potential problem, but true it could be that there could be a scheduler out there that don't like duplicated job names. I'm mostly familiar with TORQUE, which has no problem with duplicated PBS_JOBNAME - it only cares about the PBS_JOBID.

mllg commented 7 years ago

Does #124 work for you? Do you miss anything?

HenrikBengtsson commented 7 years ago

This looks great - being able to call setJobNames(ids, names) prior to submitJobs() to control new job.name seems to provide the maximum flexibility. I've confirmed that this also works in the future.batchtools framework, with and without defaults

> a %<-% { process_data(x, y) }
> b %<-% { process_data(x, y) } %label% "biom-chr13"

which adds

$ qstat -u $USER
Job ID  Job name         PID      NDS    TSK    RAM      Time S     Since   Nodes/cores
------- ---------------- ------ ----- ------ ------ --------- - ---------   -----------
901722  jobd085ee8bfeae2    --      1      1    --   99:23:59 Q       --     -- 
901723  biom-chr13          --      1      1    --   99:23:59 Q       --     -- 

Thanks!

One comment though, is there an actually need to limit the allowed character set / pattern, e.g.

> setJobNames(ids = 1L, names = "a*b")
Assertion on 'names' failed: Must comply to pattern '^[[:alnum:]_.-]+$'.

I guess this stems from the "old days" where it had to comply valid file names. FYI, at least TORQUE / PBS seems pretty liberal on what the job name can be, e.g.

Job ID  Job name         PID      NDS    TSK    RAM      Time S     Since   Nodes/cores
------- ---------------- ------ ----- ------ ------ --------- - ---------   -----------
901725  a*b,c=d,e%f,g\nh    --      1      1    --   99:23:59 Q       --     -- 
mllg commented 7 years ago

I've removed the check for job names and merged into master. The updated version will be uploaded to CRAN next week.