radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

EnTK hangs on Traverse when using multiple Nodes #138

Open lsawade opened 3 years ago

lsawade commented 3 years ago

Hi,

I don't know whether this is related to #135 . It is weird because I got everything running on a single node, but as soon as I use more than one EnTK seems to hang. I checked out the submission script and it looks fine to me; so, did the node list.

The workflow already hangs in the submission of the first task, which is a single core, single thread task.

EnTK session: re.session.traverse.princeton.edu.lsawade.018666.0003
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.traverse.princeton.edu.lsawade.018666.0003]           \
database   : [mongodb://specfm:****@129.114.17.185/specfm]                    ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   princeton.traverse       90 cores      12 gpus           ok
All components created
create unit managerUpdate: pipeline.0000 state: SCHEDULING
Update: pipeline.0000.WriteSourcesStage state: SCHEDULING
Update: pipeline.0000.WriteSourcesStage.WriteSourcesTask state: SCHEDULING
Update: pipeline.0000.WriteSourcesStage.WriteSourcesTask state: SCHEDULED
Update: pipeline.0000.WriteSourcesStage state: SCHEDULED
MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
                                                           ok
submit: ########################################################################
Update: pipeline.0000.WriteSourcesStage.WriteSourcesTask state: SUBMITTING

[Ctrl + C]

close unit manager                                                            ok
...

Stack

  python               : /home/lsawade/.conda/envs/ve-entk/bin/python3
  pythonpath           : 
  version              : 3.8.2
  virtualenv           : ve-entk

  radical.entk         : 1.5.12-v1.5.12@HEAD-detached-at-v1.5.12
  radical.gtod         : 1.5.0
  radical.pilot        : 1.5.12
  radical.saga         : 1.5.9
  radical.utils        : 1.5.12

Client zip

client.session.zip

Session zip

sandbox.session.zip

andre-merzky commented 3 years ago

Hi @lsawade - this is a surprising one. The task stdout shows:

$ cat *err
srun: Job 126172 step creation temporarily disabled, retrying (Requested nodes are busy)

This one does look like a slurm problem. Is this reproducible?

lsawade commented 3 years ago

Reproduced! The message with step creation appears after a while. Meaning I continuously checked the task's error file, and eventually the message showed up!

andre-merzky commented 3 years ago

@lsawade , would you please open an ticket with Traverse support? Maybe our srun command is not well-formed for Traverse's Slurm installation? Please include the srun command:

/usr/bin/srun --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --nodelist=/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.018666.0003/pilot.0000/unit.000000//unit.000000.nodes --export=ALL,NODE_LFS_PATH="/tmp" write-sources "-f" "/tigress/lsawade/entkdatabase/C200709121110A/C200709121110A.cmt" "-p" "/home/lsawade/gcmt3d/workflow/params" 

and the nodelist file which just contains:

traverse-k04g10
lsawade commented 3 years ago

It throws following error:

srun: error: Unable to create step for job 126202: Requested node configuration is not available

If I take out the nodelist argument, it runs

andre-merzky commented 3 years ago

Hmm, is that node name not valid somehow?

lsawade commented 3 years ago

I tried running it with the nodename as a string and that worked

/usr/bin/srun --nodelist=traverse-k05g10 --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --export=ALL,NODE_LFS_PATH="/tmp" write-sources "-f" "/tigress/lsawade/entkdatabase/C200709121110A/C200709121110A.cmt" "-p" "/home/lsawade/gcmt3d/workflow/params"

Note that I'm using salloc and hence a different nodename

lsawade commented 3 years ago

I found the solution. When SLURM takes in a file for a nodelist, one has to use the node file option:

/usr/bin/srun --nodefile=nodelistfile --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --export=ALL,NODE_LFS_PATH="/tmp" write-sources "-f" "/tigress/lsawade/entkdatabase/C200709121110A/C200709121110A.cmt" "-p" "/home/lsawade/gcmt3d/workflow/params"
andre-merzky commented 3 years ago

Oh! Thanks for tracking that down, we'll fix this!

lsawade commented 3 years ago

It is puzzling though, that srun doesn't throw an error. When I do it by hand, srun throws an error when feeding a nodelist file to the --nodelist= option

andre-merzky commented 3 years ago

@lsawade : the fix has been released, please let us know if that problem still happens!

lsawade commented 3 years ago

@andre-merzky, will test!

lsawade commented 3 years ago

Sorry, for the extraordinarily late feedback, but the issue seems to persist. It already hangs in the Hello, World task. Did I update correctly?


My stack:

```bash python : /home/lsawade/.conda/envs/ve-entk/bin/python3 pythonpath : version : 3.8.2 virtualenv : ve-entk radical.entk : 1.6.0 radical.gtod : 1.5.0 radical.pilot : 1.6.2 radical.saga : 1.6.1 radical.utils : 1.6.2 ```

My script:

```bash from radical.entk import Pipeline, Stage, Task, AppManager import traceback, sys, os hostname = os.environ.get('RMQ_HOSTNAME', 'localhost') port = int(os.environ.get('RMQ_PORT', 5672)) password = os.environ.get('RMQ_PASSWORD', None) username = os.environ.get('RMQ_USERNAME', None) specfem = "/scratch/gpfs/lsawade/MagicScripts/specfem3d_globe" if __name__ == '__main__': p = Pipeline() # Hello World######################################################## test_stage = Stage() test_stage.name = "HelloWorldStage" # Create 'Hello world' task t = Task() t.cpu_reqs = {'cpu_processes': 1, 'cpu_process_type': None, 'cpu_threads': 1, 'cpu_thread_type': None} t.pre_exec = ['module load openmpi/gcc'] t.name = "HelloWorldTask" t.executable = '/bin/echo' t.arguments = ['Hello world!'] t.download_output_data = ['STDOUT', 'STDERR'] # Add task to stage and stage to pipeline test_stage.add_tasks(t) p.add_stages(test_stage) ######################################################### specfem_stage = Stage() specfem_stage.name = 'SimulationStage' for i in range(2): # Create Task t = Task() t.name = f"SIMULATION.{i}" tdir = f"/home/lsawade/simple_entk_specfem/specfem_run_{i}" t.pre_exec = [ # Load necessary modules 'module load openmpi/gcc', 'module load cudatoolkit/11.0', # Change to your specfem run directory f'rm -rf {tdir}', f'mkdir {tdir}', f'cd {tdir}', # Create data structure in place f'ln -s {specfem}/bin .', f'ln -s {specfem}/DATABASES_MPI .', f'cp -r {specfem}/OUTPUT_FILES .', 'mkdir DATA', f'cp {specfem}/DATA/CMTSOLUTION ./DATA/', f'cp {specfem}/DATA/STATIONS ./DATA/', f'cp {specfem}/DATA/Par_file ./DATA/' ] t.executable = './bin/xspecfem3D' t.cpu_reqs = {'cpu_processes': 4, 'cpu_process_type': 'MPI', 'cpu_threads': 1, 'cpu_thread_type' : 'OpenMP'} t.gpu_reqs = {'gpu_processes': 4, 'gpu_process_type': 'MPI', 'gpu_threads': 1, 'gpu_thread_type' : 'CUDA'} t.download_output_data = ['STDOUT', 'STDERR'] # Add task to stage specfem_stage.add_tasks(t) p.add_stages(specfem_stage) res_dict = { 'resource': 'princeton.traverse', # 'local.localhost', 'schema' : 'local', 'walltime': 20, #2 * 30, 'cpus': 16, #2 * 10 * 1, 'gpus': 8, #2 * 4 * 2, } appman = AppManager(hostname=hostname, port=port, username=username, password=password, resubmit_failed=False) appman.resource_desc = res_dict appman.workflow = set([p]) appman.run() ```

Tarball:

sandbox.tar.gz

andre-merzky commented 3 years ago

Bugger... - the code though is using --nodefile=:

$ grep srun task.0000.sh
task.0000.sh:/usr/bin/srun --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --nodefile=/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.018719.0005/pilot.0000/task.0000//task.0000.nodes --export=ALL,NODE_LFS_PATH="/tmp" /bin/echo "Hello world!" 

but that task indeed never returns. Does that line work on an interactive node? FWIW, task.0000.nodes contains:

$ cat task.0000.nodes 
traverse-k02g1
lsawade commented 3 years ago

Yes, in interactive mode and change of the nodefile to the node I land on it works

Edit: In my interactive job I'm using one node only, let me try with two...

Update

It works also when using the two nodes in the interactive job and editing the task.0000.nodes to contain one of the accessible nodes. Either node works, so this does not seem to be the problem.

andre-merzky commented 3 years ago

Hmm, where does that leave us... - so it is not the srun command format which is at fault after all?

Can you switch your workload to, say, /bin/date to make sure we are not looking at the wrong place, and that the application code behaves as expected when we run under EnTK?

lsawade commented 3 years ago

Would you mind running one more test: interactively get two nodes, and run the command towards the other node than the one you land on.

See Update above

You should see the allocated nodes via cat $SLURM_NODEFILE or something like that (env | grep SLURM will be helpful)

echo $SLURM_NODELIST works, I don't seem to have the nodefile environment variable.

lsawade commented 3 years ago

What do you mean with switching my workload to /bin/date ?

lsawade commented 3 years ago

I also tested running the entire task.0000.sh in interactive mode, and it had no problem.

mturilli commented 3 years ago

Slurm on Traverse seems to be working in a strange way. Lucas is in contact with the research service at Princeton.

lsawade commented 3 years ago

Two things that have come up:

  1. The srun command needs a -G0 flag (no GPUs) if a non-gpu task is executed with a resource set that contains GPUs. The command will only hang if the resource set contains GPUs and run otherwise.
  2. Make sure your print statement does not contain any ! my hello world task also encountered issues because I didn't properly escape the ! in "Hello, World!". Use "Hello, World\!" instead. facepalm
lsawade commented 3 years ago

Most quick debugging discussions were held on Slack but here a summary for posterity: @andre-merzky published a quick fix for the srun command one of the RP branches (https://github.com/radical-cybertools/radical.pilot/commit/aee4fb8862fa4fbf55589a23a9cc0c66ee839d40), but there is KeyError that is issued by EnTK when calling something from the pilot.


Error

``` EnTK session: re.session.traverse.princeton.edu.lsawade.018720.0001 Creating AppManagerSetting up RabbitMQ system ok ok Validating and assigning resource manager ok Setting up RabbitMQ system n/a new session: [re.session.traverse.princeton.edu.lsawade.018720.0001] \ database : [mongodb://specfm:****@129.114.17.185/specfm] ok All components terminated Traceback (most recent call last): File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/execman/rp/resource_manager.py", line 147, in _submit_resource_request self._pmgr = rp.PilotManager(session=self._session) File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/pilot/pilot_manager.py", line 93, in __init__ self._pilots_lock = ru.RLock('%s.pilots_lock' % self._uid) AttributeError: 'PilotManager' object has no attribute '_uid' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 428, in run self._rmgr._submit_resource_request() File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/execman/rp/resource_manager.py", line 194, in _submit_resource_request raise EnTKError(ex) from ex radical.entk.exceptions.EnTKError: 'PilotManager' object has no attribute '_uid' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "solver.py", line 104, in appman.run() File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 459, in run raise EnTKError(ex) from ex radical.entk.exceptions.EnTKError: 'PilotManager' object has no attribute '_uid' ```

Stack

``` python : /home/lsawade/.conda/envs/ve-entk/bin/python3 pythonpath : version : 3.8.2 virtualenv : ve-entk radical.entk : 1.6.0 radical.gtod : 1.5.0 radical.pilot : 1.6.2-v1.6.2-78-gaee4fb886@fix-hpc_wf_138 radical.saga : 1.6.1 radical.utils : 1.6.2 ```
andre-merzky commented 3 years ago

My apologies, that error is now fixed in RP.

lsawade commented 3 years ago

Getting a new one again!

EnTK session: re.session.traverse.princeton.edu.lsawade.018720.0008
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.traverse.princeton.edu.lsawade.018720.0008]           \
database   : [mongodb://specfm:****@129.114.17.185/specfm]                    ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   princeton.traverse       16 cores       8 gpus           ok
closing session re.session.traverse.princeton.edu.lsawade.018720.0008          \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
session lifetime: 16.1s                                                       ok
wait for 1 pilot(s)
              0                                                          timeout
All components terminated
Traceback (most recent call last):
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 428, in run
    self._rmgr._submit_resource_request()
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/execman/rp/resource_manager.py", line 177, in _submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/home/lsawade/thirdparty/python/radical.pilot/src/radical/pilot/pilot.py", line 558, in wait
    time.sleep(0.1)
KeyboardInterrupt

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "solver.py", line 104, in <module>
    appman.run()        
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 453, in run
    raise KeyboardInterrupt from ex
KeyboardInterrupt
lsawade commented 3 years ago

So the way I install entk and the pilot at the moment is as follows:

# Install EnTK
conda create -n conda-entk python=3.7 -c conda-forge -y
conda activate conda-entk
pip install radical.entk

(Note, I'm not changing pilot here and just keep the default one.)

Then, I get the radical.pilot repo to create the static ve.rp. Log out, log in,

# Create environment
module load anaconda3
conda create -n ve -y python=3.7
conda activate ve

# Install Pilot
git clone git@github.com:radical-cybertools/radical.pilot.git
cd radical.pilot
pip install .

# Create static environment
./bin/radical-pilot-create-static-ve -p /scratch/gpfs/$USER/ve.rp/

Log out, Log in:

conda activate conda-entk
python workflow.py
lsawade commented 3 years ago

Are there any news here?

lsawade commented 3 years ago

@andre-merzky ?

lsawade commented 3 years ago

Alright, I got the workflow manager to -- at least -- run. Not hanging, yay

One of the issues is that when I create the static environment using radical-pilot-create-static-ve, it does not install any dependencies, so I installed all requirements into the ve.rp.

However, I'm sort of back to square one. A serial task executes, and task.0000.out has "Hello World" in it, and the log shows that task.0000 does return with a 0 exit code, but it also fails as indicated by the workflow manager and task.0000.err contains following line:

cpu-bind=MASK - traverse-k01g10, task 0 0 [140895]: mask 0xf set

I'll attach the tarball.

It is also important to state that the Manager seems to drop scheduling other jobs upon failure of the first task. I wasn't able to find anything about it in the log.


sandbox.tar.gz

lsawade commented 3 years ago

@andre-merzky ping

mtitov commented 3 years ago

hi @lsawade , that message in err file looks like a verbose message and doesn't indicate an error

and some additional comments: (a) radical-pilot-create-static-ve: for dependencies there is an extra option -d (== set default modules) (b) if for running your workflow you use a shared FS (using the same virtual env for client and pilot), then you can set that in the resource config, e.g.,

        "python_dist"                 : "anaconda",  # there are two options: "default" or "anaconda" (for conda env)
        "virtenv_mode"                : "use",
        "virtenv"                     : <name/path>,  # better to use a full path
        "rp_version"                  : "installed",  # if RCT packages are pre-installed

@andre-merzky, just as a side comment, with pre-set resource configs should we set default value for python_dist as anaconda? (since there is module load anaconda3 in pre_bootstrap_0)

lsawade commented 3 years ago

Hi @mtitov , thank for getting back to me. Aah, I missed that when installing the static-ve.

that message in err file looks like a verbose message and doesn't indicate an error

That's what I thought, too. I mean the tasks finishes successfully (STDOUT Is fine). It just flags itself as failed when I run the appmanager. So, I'm a bit unsure why the Task fails.

mtitov commented 3 years ago

Yeah, that what I missed, so task has the final state FAILED and has it after TMGR_STAGING_OUTPUT, thus I assume something went wrong on client side. @lsawade can you please attach client sandbox as well?

lsawade commented 3 years ago

Sorry I only saw the notification now, attached the corresponding client sandbox.


client_sandbox.tar.gz

mtitov commented 3 years ago

hi @lsawade , thank you for a sandbox, looks like the issue is with the name of the output: by default RP sets the name of task output as <task_uid>.out (and it is similar for err-file), before we had it as STDOUT for all tasks. For now if you want to collect corresponding outputs without using task ids, then output file name could be set explicitly, thus:

t = Task()
t.stdout = 'STDOUT'
...
t.download_output_data = ['STDOUT']

(*) With your run everything went fine, just at the end TaskManager couldn't collect STDOUT

lsawade commented 3 years ago

Lord, if that ends up being the final issue, that would be wild... Let me test this later today, and I will get back to you!

lsawade commented 3 years ago

So, I tested stuff yesterday, and things seem to work out! There is one catch that is probably solvable. When I have need GPUs from different nodes I feel like the mpirun in the task.000x.sh has to fail because it does not know which GPUs to use. Meaning, I want to run 2 instances of specfem simultaneously, each needs 6 GPUs, but I only have 4 GPUs per node and am running on 3 nodes (12 GPUs total). That means there is an overlap in nodes, which I don't think/am not sure about mpirun can handle by itself?

Task 1:

mpirun  -np 6  -host traverse-k04g9,traverse-k04g9,traverse-k05g10,traverse-k05g10,traverse-k05g10,traverse-k05g10 -x ...

Task 2:

mpirun  -np 6  -host traverse-k04g7,traverse-k04g7,traverse-k04g7,traverse-k04g7,traverse-k04g9,traverse-k04g9  -x ... 

Note that both use traverse-k04g9 but in the rest of the executable, there is no sign of which GPU is supposed to be used, and both Tasks never execute.

lsawade commented 3 years ago

Update:

I tried to run the mpirun line in interactive mode and it hangs. I do not know why. It even hangs when I do not specify the nodes. But(!), this one does not:

srun -n 6 --gpus-per-task=1 ./bin/xspecfem3D

lsawade commented 3 years ago

Just a quick update. I'm still looking for a work around here and in contact with the Research computing people here.

srun -n 6 --gpus=6 <some test task>

works, but when I do

srun -n 6 --gpus=6 ./bin/xspecfem3D

it doesn't. Very curious, but I'm on it, and will put more info here eventually.

lsawade commented 3 years ago

Just a quick update, the above described commands are executed differently depending on the cluster at hand in Princeton. Meaning that it will be hard to generalize Slurm submission. I have been talking to people from picscie, there is no obvious solution right now. I will get back here again once I have more info.

mturilli commented 3 years ago

@lsawade to provide an example that we can test on other clusters with SLURM.

lsawade commented 3 years ago

I cannot test whether this would work, but below an example that I expect to work if slurm is configured correctly.

The jobs submit, just not in parallel. This submission setup is for 3 Nodes, where each node has 4 GPUs, and two gpu-requiring sruns have to be executed, each with 6 tasks and a 1 gpu per task. For this setup to run in parallel, the two sruns would have to share a node.

Let

#!/bin/bash
#SBATCH -t00:05:00
#SBATCH --gpus 12
#SBATCH -n 12
#SBATCH --output=mixed_gpu.txt

module load openmpi/gcc cudatoolkit

srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 0 &
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 1 &

wait

where show_devices.sh:

#!/bin/bash

echo Script $1
echo JOB $SLURM_JOB_ID STEP $SLURM_STEP_ID 
echo $CUDA_VISIBLE_DEVICES
sleep 60

Output of mixed_gpu should look somewhat like this:

``` Script 0 Script 0 JOB 203751 STEP 0 JOB 203751 STEP 0 0,1 0,1 Script 0 Script 0 Script 0 JOB 203751 STEP 0 0,1 JOB 203751 STEP 0 0,1 Script 0 JOB 203751 STEP 0 JOB 203751 STEP 0 0,1 0,1 srun: Step created for job 203751 Script 1 Script 1 JOB 203751 STEP 1 Script 1 JOB 203751 STEP 1 0,1 0,1 Script 1 JOB 203751 STEP 1 JOB 203751 STEP 1 0,1 0,1 Script 1 Script 1 JOB 203751 STEP 1 JOB 203751 STEP 1 0,1 0,1 ```

and the job steps 203755.0 and 203755.1 should start at roughly the same time, unlike here:

sacct --format JobID%20,Start,End,Elapsed,ReqCPUS,JobName%20, -j 203755
``` JobID Start End Elapsed ReqCPUS JobName -------------------- ------------------- ------------------- ---------- -------- -------------------- 203755 2021-06-25T14:51:42 2021-06-25T14:53:45 00:02:03 4 testslurm.sh 203755.batch 2021-06-25T14:51:42 2021-06-25T14:53:45 00:02:03 4 batch 203755.extern 2021-06-25T14:51:42 2021-06-25T14:53:45 00:02:03 8 extern 203755.0 2021-06-25T14:51:43 2021-06-25T14:52:44 00:01:01 8 show_devices.sh 203755.1 2021-06-25T14:52:44 2021-06-25T14:53:45 00:01:01 8 show_devices.sh ```
lsawade commented 2 years ago

ping

andre-merzky commented 2 years ago

Hi @lsawade, I will have time on Friday to work on this and hope to have results back before our call.

andre-merzky commented 2 years ago

Hey @lsawade - the reason for the behavior eludes me completely. I can confirm that the same is observed on at least one other Slurm cluster (expanse @ SDSC), and I opened a ticket there to hopefully get some useful feedback. At the moment I simply don't know how we can possibly resolve this. I am really sorry for that, I understand that this is blocking progress since several months by now :-/

lsawade commented 2 years ago

Yeah, I have had a really long thread with the people from the research computing group and they did not understand why this is not working either. Maybe we should contact the slurm people?

andre-merzky commented 2 years ago

Yes, I think we should resort to that. I'll open a ticket if the XSEDE support is not able to suggest a solution within a week.

andre-merzky commented 2 years ago

We got some useful feedback from XSEDE after all: slurm seems indeed to be unable to do correct auto-placement for non-node-local tasks. I find this surprising, and it may still be worthwhile to open a slurm ticket about this. Either way though: a workaround is to start the job with a specific node file. From your example above:

srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 0 &
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 1 &

should work as expected with

export SLURM_HOSTFILE=host1.list
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 --distribution=arbitrary show_devices.sh 0 &
export SLURM_HOSTFILE=host2.list
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 --distribution=arbitrary show_devices.sh 1 &

where the host file look like, for example:

$ cat host2.list
exp-1-57
exp-1-57
exp-6-58
exp-6-58
exp-6-58
exp-6-58

Now, that brings us back to RP / EnTK: we actually do use a hostfile, we just miss out on --distribution=arbitrary flag. Before we include that, could you please confirm that the above also in fact works on Traverse please?

lsawade commented 2 years ago

Hi @andre-merzky,

I have been playing with this and I can't seem to get it to work. I explain what I do here: https://github.com/lsawade/slurm-job-step-shared-res

I'm not sure whether it's me or Traverse.

Can you adjust this mini example to see whether it runs on XSEDE? Things you would have to change are the automatic writing of the hostfile and how many tasks per job step. If you give me the hardware setup of XSEDE, I could also adjust the script and give you something that should run out of the box to check.

andre-merzky commented 2 years ago

The hardware setup on Expanse is really similar to Traverse: 4 GPUs/node.

I pasted something incorrect above, apologies! Too many scripts lying around :-/ The --gpus=6 flag was missing. Here should be the correct one, showing the same syntax working for both cyclic and block:

This is the original script:

```sh $ cat test2.slurm #!/bin/bash #SBATCH -t00:10:00 #SBATCH --account UNC100 #SBATCH --nodes 3 #SBATCH --gpus 12 #SBATCH -n 12 #SBATCH --output=test2.out #SBATCH --error=test2.out my_srun() { export SLURM_HOSTFILE="$1" srun -n 6 --gpus=6 --cpus-per-task=1 --gpus-per-task=1 --distribution=arbitrary show_devices.sh } cyclic() { scontrol show hostnames "${SLURM_JOB_NODELIST}" > host1.cyclic.list scontrol show hostnames "${SLURM_JOB_NODELIST}" >> host1.cyclic.list scontrol show hostnames "${SLURM_JOB_NODELIST}" > host2.cyclic.list scontrol show hostnames "${SLURM_JOB_NODELIST}" >> host2.cyclic.list my_srun host1.cyclic.list > cyclic.1.out 2>&1 & my_srun host2.cyclic.list > cyclic.2.out 2>&1 & wait } block() { scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 > host1.block.list scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 >> host1.block.list scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 >> host1.block.list scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 >> host1.block.list scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 >> host1.block.list scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 >> host1.block.list scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 > host2.block.list scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 >> host2.block.list scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list my_srun host1.block.list > block.1.out 2>&1 & my_srun host2.block.list > block.2.out 2>&1 & wait } block cyclic ```

These are the resulting node files:

```sh $ for f in *list; do echo $f; cat $f; echo; done host1.block.list exp-6-57 exp-6-57 exp-6-57 exp-6-57 exp-6-59 exp-6-59 host1.cyclic.list exp-6-57 exp-6-59 exp-10-58 exp-6-57 exp-6-59 exp-10-58 host2.block.list exp-6-59 exp-6-59 exp-10-58 exp-10-58 exp-10-58 exp-10-58 host2.cyclic.list exp-6-57 exp-6-59 exp-10-58 exp-6-57 exp-6-59 exp-10-58 ```

and these the resulting outputs:

```sh $ for f in *out; do echo $f; cat $f; echo; done block.1.out 6664389.1.2 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3 6664389.1.1 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3 6664389.1.0 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3 6664389.1.3 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3 6664389.1.5 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1 6664389.1.4 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1 6664389.1.2 STOP Mon Oct 25 02:54:52 PDT 2021 6664389.1.1 STOP Mon Oct 25 02:54:52 PDT 2021 6664389.1.0 STOP Mon Oct 25 02:54:52 PDT 2021 6664389.1.3 STOP Mon Oct 25 02:54:52 PDT 2021 6664389.1.4 STOP Mon Oct 25 02:54:52 PDT 2021 6664389.1.5 STOP Mon Oct 25 02:54:52 PDT 2021 block.2.out 6664389.0.1 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1 6664389.0.2 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3 6664389.0.0 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1 6664389.0.3 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3 6664389.0.5 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3 6664389.0.4 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3 6664389.0.0 STOP Mon Oct 25 02:54:52 PDT 2021 6664389.0.1 STOP Mon Oct 25 02:54:52 PDT 2021 6664389.0.4 STOP Mon Oct 25 02:54:52 PDT 2021 6664389.0.3 STOP Mon Oct 25 02:54:52 PDT 2021 6664389.0.5 STOP Mon Oct 25 02:54:52 PDT 2021 6664389.0.2 STOP Mon Oct 25 02:54:52 PDT 2021 cyclic.1.out 6664389.2.3 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1 6664389.2.2 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1 6664389.2.4 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1 6664389.2.5 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1 6664389.2.0 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1 6664389.2.1 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1 6664389.2.3 STOP Mon Oct 25 02:55:02 PDT 2021 6664389.2.2 STOP Mon Oct 25 02:55:02 PDT 2021 6664389.2.4 STOP Mon Oct 25 02:55:02 PDT 2021 6664389.2.0 STOP Mon Oct 25 02:55:02 PDT 2021 6664389.2.1 STOP Mon Oct 25 02:55:02 PDT 2021 6664389.2.5 STOP Mon Oct 25 02:55:02 PDT 2021 cyclic.2.out 6664389.3.3 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1 6664389.3.5 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1 6664389.3.4 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1 6664389.3.0 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1 6664389.3.2 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1 6664389.3.1 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1 6664389.3.5 STOP Mon Oct 25 02:55:02 PDT 2021 6664389.3.3 STOP Mon Oct 25 02:55:02 PDT 2021 6664389.3.4 STOP Mon Oct 25 02:55:02 PDT 2021 6664389.3.2 STOP Mon Oct 25 02:55:02 PDT 2021 6664389.3.1 STOP Mon Oct 25 02:55:02 PDT 2021 6664389.3.0 STOP Mon Oct 25 02:55:02 PDT 2021 ```
lsawade commented 2 years ago

So, I have some good news, I have also tested this on Andes, and it definitely works on Andes as well. An added batch_andes.sh batch script to the repo to test the arbitrary distribution for cyclic and block with nodes [1,2], [1,2] and [1], [1,2,2], respectively.

The annoying news are that it does not seem to work on Traverse. At least I was able to test whether it's a user error...

So, how do we proceed? I'm sure it's a setting in the slurm setup. Do we open a ticket with the Andes/Expanse support? I'll for sure open a ticket with PICSciE and see whether they can find a solution.

UPDATE:

The unexpected/unwanted output on Traverse:

```bash block.1.out srun: Job 258710 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Step created for job 258710 258710.3.3 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0 258710.3.2 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0 258710.3.0 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0 258710.3.1 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0 258710.3.5 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g3: 0 258710.3.4 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g3: 0 258710.3.0 STOP Mon Oct 25 19:41:25 EDT 2021 258710.3.1 STOP Mon Oct 25 19:41:25 EDT 2021 258710.3.2 STOP Mon Oct 25 19:41:25 EDT 2021 258710.3.3 STOP Mon Oct 25 19:41:25 EDT 2021 258710.3.4 STOP Mon Oct 25 19:41:25 EDT 2021 258710.3.5 STOP Mon Oct 25 19:41:25 EDT 2021 block.2.out 258710.2.0 START Mon Oct 25 19:39:24 EDT 2021 @ traverse-k05g3: 0 258710.2.1 START Mon Oct 25 19:39:24 EDT 2021 @ traverse-k05g3: 0 258710.2.0 STOP Mon Oct 25 19:40:24 EDT 2021 258710.2.1 STOP Mon Oct 25 19:40:24 EDT 2021 cyclic.1.out 258710.0.1 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g3: 0 258710.0.0 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g2: 0 258710.0.3 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g3: 0 258710.0.2 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g2: 0 258710.0.1 STOP Mon Oct 25 19:38:23 EDT 2021 258710.0.3 STOP Mon Oct 25 19:38:23 EDT 2021 258710.0.0 STOP Mon Oct 25 19:38:23 EDT 2021 258710.0.2 STOP Mon Oct 25 19:38:23 EDT 2021 cyclic.2.out srun: Job 258710 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Step created for job 258710 258710.1.0 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g2: 0 258710.1.1 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g3: 0 258710.1.2 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g2: 0 258710.1.3 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g3: 0 258710.1.0 STOP Mon Oct 25 19:39:24 EDT 2021 258710.1.2 STOP Mon Oct 25 19:39:24 EDT 2021 258710.1.1 STOP Mon Oct 25 19:39:24 EDT 2021 258710.1.3 STOP Mon Oct 25 19:39:24 EDT 2021 ```

Does it almost look like there is a misunderstanding between slurm and cude, the devices visible should not be all CUDA_VISIBLE_DEVICES?

PS: I totally stole the way you made the block and cyclic functions as well as the printing. Why did I not think of that...?

lsawade commented 2 years ago

Ok I can run things on Traverse using this setup. But there are some things I have learnt:

One traverse to not give a job step the entire CPU affinity of the involved nodes, I have to use the --exclusive flag in srun, which indicates that certain cpus/cores are exclusively used by that job step and not anything else.

Furthermore, I cannot use --cpus-per-task=1. Which makes a lot of sense, and CPU affinity prints should have rang a bell for me. I feel dense.

So, at request, I ask SBATCH like so:

#SBATCH -n 8
#SBATCH --cpus-per-task=4
#SBATCH --gpus-per-task=1

and then

srun --ntasks=4 --gpus-per-task=1 --cpus-per-task=4 --distribution=arbitrary --exclusive script.sh

or even

srun --ntasks=4 --distribution=arbitrary --exclusive script.sh

would work.

What does not work is the following:

...
#SBATCH -n 8
#SBATCH -G 8
srun --ntasks=4 --gpus-per-task=1 --cpus-per-task=4 --distribution=arbitrary --exclusive 

For some reason, I cannot request a pool of GPUs and take from it.