nv-legate / legate.core

The Foundation for All Legate Libraries
https://docs.nvidia.com/legate/24.06/
Apache License 2.0
189 stars 63 forks source link

SLURM_TASKS_PER_NODE is not always an integer #918

Closed manopapad closed 10 months ago

manopapad commented 10 months ago

We're seeing it reported e.g. as "1(x2)"

manopapad commented 10 months ago

Apparently this is expected https://slurm.schedmd.com/sbatch.html#OPT_SLURM_TASKS_PER_NODE:

SLURM_TASKS_PER_NODE
Number of tasks to be initiated on each node. Values are comma separated and in the same order as SLURM_JOB_NODELIST. If two or more consecutive nodes are to have the same task count, that count is followed by "(x#)" where "#" is the repetition count. For example, "SLURM_TASKS_PER_NODE=2(x3),1" indicates that the first three nodes will each execute two tasks and the fourth node will execute one task.

So in our case, where the launch is symmetric (all nodes execute the same number of ranks/tasks), we should expect it to have the value "1" if using one node or "1(xN)" if using N>1 nodes.

Merging for now, we can handle this properly on a follow-up PR.