mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
171 stars 51 forks source link

SLURM jobs -- direct them to a specific folder? #192

Closed jgrn307 closed 6 years ago

jgrn307 commented 6 years ago

Ok, so the array jobs are now working great BUT I now have a new issue. It looks like batchtools drops the job files into a temporary directory with an .Rdata file rather than in the directory specified by the registry (or some use specified directory). The issue is, if I quit out of the R session that was used to submit the job, that directory is deleted on exit and the jobs fail if they aren't already running (that Rdata file is deleted). Can you perhaps drop the job files into e.g. a subfolder of the registry so I don't have to idle the R "submitter" job until the jobs are all submitted? Cheers!

mllg commented 6 years ago

The RData files are always stored in the subdirectory jobs of the registry's file.dir. The slave reads this file, then deletes it and starts computation. If you are working with temporary registries (file.dir set to NA), you can set a custom temp directory in your config file, e.g.

temp.dir = "~/temp"

The job description files (the result of the brewing) OTOH are indeed stored in the R session's temporary directory. However, while submitJobs() iterates over the jobs, they are created right before calling squeue and are then usually not needed any more.

So, in principle it should be safe to close the R session on the master and logout. Can you please double check that the job subdirectory is not used for this? Or do array jobs need the brewed job description files until all jobs of the array are terminated?

jgrn307 commented 6 years ago

It looks like its creating a Rtmp* folder within the same directory as the registry is being made, and dropping the .job file there rather than in the job directory.

For instance, I just submitted a job and got the following two folders:

drwxr-sr-x 8 jgreenberg gears-readonly 4096 May 24 13:50 registry_20180524_204853
drwx--S--- 2 jgreenberg gears-readonly 4096 May 24 13:50 RtmpJImxfi

ls -l $TMPDIR/RtmpJImxfi
total 448
-rw-rw-r-- 1 jgreenberg gears-readonly 450626 May 24 13:49 file29cfa3278bb62.Rdata
-rw-rw-r-- 1 jgreenberg gears-readonly    720 May 24 13:50 jobbb64862d6514b05aefccf0a6443e48ce.job

What is that .Rdata file within the Rtmp (the SLURM job folder)? That is what gets deleted when I quit out of R (as R does its temporary file cleanup).

mllg commented 6 years ago

That's odd. Can you please execute the following shell commands and report the output?

echo $TMPDIR
echo $TMP
echo $TEMP
Rscript -e "tempdir()"
Rscript --vanilla -e "tempdir()"
Rscript -e "fs::path_temp()"
Rscript --vanilla -e "fs::path_temp()"

Additionally, in R:

batchtools::makeRegistry(file.dir = NA)$temp.dir
jgrn307 commented 6 years ago

Output below:

Singularity gearslaboratory-gears-singularity-master-gears-general.simg:~> echo $TMPDIR
/data/gpfs/assoc/gears/scratch/jgreenberg
Singularity gearslaboratory-gears-singularity-master-gears-general.simg:~> echo $TMP
TMP
Singularity gearslaboratory-gears-singularity-master-gears-general.simg:~> echo $TEMP
TEMP
Singularity gearslaboratory-gears-singularity-master-gears-general.simg:~> Rscript -e "tempdir()"
[1] "/data/gpfs/assoc/gears/scratch/jgreenberg/RtmpKiwHlI"
Singularity gearslaboratory-gears-singularity-master-gears-general.simg:~> Rscript --vanilla -e "tempdir()"
[1] "/data/gpfs/assoc/gears/scratch/jgreenberg/RtmpxjhVNP"
Singularity gearslaboratory-gears-singularity-master-gears-general.simg:~> Rscript -e "fs::path_temp()"
/data/gpfs/assoc/gears/scratch/jgreenberg/Rtmp4igC4O
Singularity gearslaboratory-gears-singularity-master-gears-general.simg:~> Rscript --vanilla -e "fs::path_temp()"
/data/gpfs/assoc/gears/scratch/jgreenberg/RtmpEIHK4T
> batchtools::makeRegistry(file.dir = NA)$temp.dir
No readable configuration file found
Created registry in '/data/gpfs/assoc/gears/scratch/jgreenberg/RtmpnfJShr/registry7ac811a48bc0' using cluster functions 'Interactive'
/data/gpfs/assoc/gears/scratch/jgreenberg/RtmpnfJShr
jgrn307 commented 6 years ago

Re: your earlier question about the job files needing be around -- I am not 100% sure but I suspect they may need to persist until the array job is fully running -- I imagine the queue manager doesn't want to keep large array job files in memory? I can do some tests if you want, but I think the deleting files thing may be an issue -- also, what is that file*.Rdata? Regardless of the job file, if that is needed it definitely would causes issues. Job dependencies (the other recent request) may also need to be persistent -- not sure, tho.

I think it would be "safer" to simply have all the job files and related files stored where the registry is located. If someone sets file.dir to NA then sure, go ahead and drop it in the tempdir. But if a formal registry is set, I think it would be helpful to have a folder for the jobs (if you don't have one already) -- you could even have a folder called "slurm" for slurm jobs, and related for other queue managers. Let the user clean up after themselves, in this case. It is sometimes helpful to see the final job files for troubleshooting.

jgrn307 commented 6 years ago

From my sysadmin: "It just stores the environment, and path to the script in slurm". So yeah, it does need the job files after submission.

HenrikBengtsson commented 6 years ago

There's a Linux container (Singularity-based) involved here. It could be that it is configured to do mount-point remapping. If so, I doubt this is a batchtools issue. One can remap differently when calling singularity.

My $.02

jgrn307 commented 6 years ago

I don't think the singularity container is the issue -- from R's perspective, it is seeing most of the same directories that a "native" R would -- same environments and folder mapping (I have all the relevant folders mapped to the container). It is using the correct temporary folder from the singularity version of R (it is a real folder on our system, not one within the container), the problem is more that the job files really shouldn't be forced to a temporary folder, since they get deleted if I quit out of the "main" R and any jobs that are queued up will fail because they can't find the job script (or that .Rdata file). A possible workflow would be to run an Rscript that creates/submits slurm jobs, but isn't idling while the jobs run -- a second job would aggregate the outputs afterwards. Right now, it looks like I have to keep R running for a long time (until all my jobs complete -- and I'm running ~ 20,000 4-hour jobs for my process) -- so if that one R session fails for some reason, the rest of the jobs will fail.

Is there a reason to not let the user select the directory for the jobs (or is there a way to do that)? It seems an easy fix is to simply drop the job files into the registry folder to keep everything neat and persistent unless the user sets file.dir=NA.

mllg commented 6 years ago

For the sake of simplicity, with 7f32f84b681b6ae805dac73fb5d1fb565882ffb7 the job files are now always created in the jobs subdirectory. Let me know if this solves your issues.

The Rdata file in your temp directory is not created by batchtools though, the package only writes files with rds extension.

jgrn307 commented 6 years ago

That did it!