The last line of this script is running my training python code, with parameters, such as : python -u train.py --name exp_name --steps 1200 --batch 16
To make use of Doce, a specific .slurm script should be run for a set of specific settings, which corresponds to one particular job. So in order to run the training for multiple settings, Doce should create one .slurm file for each set of settings, in which the last line runs the main Doce script for only one setting : python main.py -c -s setting1=...+setting2=...
Note that other lines of the .slurm script might be custom for a set of setting, e.g., #SBATCH --output=gpu_mono%j.out and #SBATCH --error=gpu_mono%j.out which contain the outputs and errors of the job. Could be interesting to name them after the corresponding set of settings.
These output files contain both Python and runtime errors, so you might want to adapt Doce error catching to know if the job has been finished without any error or not.
Further works: it could be nice to add a routine to split a training in multiple jobs if the total script time is more than 20 hours (Jean-zay limit). I follow this way of doing sequential jobs : http://www.idris.fr/jean-zay/cpu/jean-zay-cpu-exec_cascade.html
For now, I launch jobs on Jean-zay using
sbatch script.slurm
withscript.slurm
containing the code found here : http://www.idris.fr/jean-zay/gpu/jean-zay-gpu-exec_mono_batch.htmlThe last line of this script is running my training python code, with parameters, such as : python -u train.py --name exp_name --steps 1200 --batch 16
To make use of Doce, a specific .slurm script should be run for a set of specific settings, which corresponds to one particular job. So in order to run the training for multiple settings, Doce should create one .slurm file for each set of settings, in which the last line runs the main Doce script for only one setting :
python main.py -c -s setting1=...+setting2=...
Note that other lines of the .slurm script might be custom for a set of setting, e.g.,#SBATCH --output=gpu_mono%j.out
and#SBATCH --error=gpu_mono%j.out
which contain the outputs and errors of the job. Could be interesting to name them after the corresponding set of settings.These output files contain both Python and runtime errors, so you might want to adapt Doce error catching to know if the job has been finished without any error or not.
Further works: it could be nice to add a routine to split a training in multiple jobs if the total script time is more than 20 hours (Jean-zay limit). I follow this way of doing sequential jobs : http://www.idris.fr/jean-zay/cpu/jean-zay-cpu-exec_cascade.html