radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Small job: Sometimes one node, sometimes multiple nodes? #136

Closed lsawade closed 3 years ago

lsawade commented 3 years ago

Hi,

I just noticed some seemingly arbitrary behavior. I'm asking for 5 CPUs, 1 GPU in a test job, and sometimes the pilot requests 2 and sometimes a single node in SLURM. Is that expected?

Cheers,

Lucas

lsawade commented 3 years ago

In my recent submissions, I have only ever encountered 1 node. I do not know if anything changed on my end. If this continues, just being 1 Node, and I don't encounter the 2 Node thing again today, I will close the issue.

lee212 commented 3 years ago

FYI,

You can check the Slurm script by enabling DEBUG mode in the pilot because it is a description of your job defining node counts such as 1 or 2 nodes. First, to enable the debug mode:

export RADICAL_LOG_LVL="DEBUG"
export RADICAL_PROFILE="TRUE"

Once your job is finished, run grep with SBATCH from radical.log in the client sandbox directory (or open the file and search), you may find lines, for example:

#SBATCH --nodes=1

Also, the pilot has system profiles to calculate node counts, for example, Traverse is defined with 32 CPU cores and 4 GPU devices per node from here: https://github.com/radical-cybertools/radical.pilot/blob/devel/src/radical/pilot/configs/resource_princeton.json#L16-L17 And your job requests N nodes based on the CPU/GPU of the profile.

lsawade commented 3 years ago

Thanks for the tip!

lsawade commented 3 years ago

Not encountered again, closing..