mila-iqia / mila-docs

Mila technical documentation
https://docs.mila.quebec
8 stars 23 forks source link

Clearly define the difference between srun and salloc #54

Open satyaog opened 3 years ago

satyaog commented 3 years ago

Screen Shot 2021-08-07 at 5 11 56 PM This is confusing. salloc was just defined previously, so saying that it "can also be used" is confusing. Perhaps it'd be better to clearly define the difference between srun and salloc? https://stackoverflow.com/questions/22152400/slurm-what-is-the-difference-for-code-executing-under-salloc-vs-srun

Originally posted by @tesfaldet in https://github.com/mila-iqia/mila-docs/issues/46#issuecomment-894707801

fosterrath-mila commented 2 years ago

This is partially explained by recent PRs to the theoretical section. Maybe this requires more explanatory examples and reference to further docs.

ahmam commented 2 years ago

salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.

srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared resources within the job's node allocation.

ahmam commented 2 years ago

@satyaog i will add this explanation if if this clear i will open merge request . difference between salloc et srun

srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared resources within the job's node allocation. Furthemore srun can also be invoked outside of a job allocation. In that case, srun requests resources, and when those resources are granted, launches tasks across those resources as a single job and job step.Whereas salloc is just used to allocate resources for job in real time.Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.

satyaog commented 2 years ago

@ahmam thanks this looks good to me

tesfaldet commented 2 years ago

sbatch and salloc allocate resources to a job, while srun launches parallel tasks across those resources. When invoked within a job allocation, srun will launch parallel tasks across some or all of the allocated resources. In that case, srun inherits by default the pertinent options of the sbatch or salloc which it runs under. You can then (usually) provide srun different options which will override what it receives by default. Each invocation of srun within a job is known as a job step.

srun can also be invoked outside of a job allocation. In that case, srun requests resources, and when those resources are granted, launches tasks across those resources as a single job and job step.

^Taken from the slurm-users mailing list