mathieulagrange / doce

Doce is a python library designed to help you handling the complexity of computational experiments.
Apache License 2.0
8 stars 2 forks source link

Integrating Doce with Jean-zay #200

Open pagrumiaux opened 1 year ago

pagrumiaux commented 1 year ago

For now, I launch jobs on Jean-zay using sbatch script.slurm with script.slurm containing the code found here : http://www.idris.fr/jean-zay/gpu/jean-zay-gpu-exec_mono_batch.html

The last line of this script is running my training python code, with parameters, such as : python -u train.py --name exp_name --steps 1200 --batch 16

To make use of Doce, a specific .slurm script should be run for a set of specific settings, which corresponds to one particular job. So in order to run the training for multiple settings, Doce should create one .slurm file for each set of settings, in which the last line runs the main Doce script for only one setting : python main.py -c -s setting1=...+setting2=... Note that other lines of the .slurm script might be custom for a set of setting, e.g., #SBATCH --output=gpu_mono%j.out and #SBATCH --error=gpu_mono%j.out which contain the outputs and errors of the job. Could be interesting to name them after the corresponding set of settings.

These output files contain both Python and runtime errors, so you might want to adapt Doce error catching to know if the job has been finished without any error or not.

Further works: it could be nice to add a routine to split a training in multiple jobs if the total script time is more than 20 hours (Jean-zay limit). I follow this way of doing sequential jobs : http://www.idris.fr/jean-zay/cpu/jean-zay-cpu-exec_cascade.html

mathieulagrange commented 1 year ago

the new version has a -j filename.

The filename should be a slurm template. Example for jean zay attached jz.slurm.txt

mathieulagrange commented 1 year ago

small doc available using -h