EnTK hangs on Traverse when using multiple Nodes

lsawade commented 3 years ago

Hi,

I don't know whether this is related to #135 . It is weird because I got everything running on a single node, but as soon as I use more than one EnTK seems to hang. I checked out the submission script and it looks fine to me; so, did the node list.

The workflow already hangs in the submission of the first task, which is a single core, single thread task.

EnTK session: re.session.traverse.princeton.edu.lsawade.018666.0003
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.traverse.princeton.edu.lsawade.018666.0003]           \
database   : [mongodb://specfm:****@129.114.17.185/specfm]                    ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   princeton.traverse       90 cores      12 gpus           ok
All components created
create unit managerUpdate: pipeline.0000 state: SCHEDULING
Update: pipeline.0000.WriteSourcesStage state: SCHEDULING
Update: pipeline.0000.WriteSourcesStage.WriteSourcesTask state: SCHEDULING
Update: pipeline.0000.WriteSourcesStage.WriteSourcesTask state: SCHEDULED
Update: pipeline.0000.WriteSourcesStage state: SCHEDULED
MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
                                                           ok
submit: ########################################################################
Update: pipeline.0000.WriteSourcesStage.WriteSourcesTask state: SUBMITTING

[Ctrl + C]

close unit manager                                                            ok
...

Stack

  python               : /home/lsawade/.conda/envs/ve-entk/bin/python3
  pythonpath           : 
  version              : 3.8.2
  virtualenv           : ve-entk

  radical.entk         : 1.5.12-v1.5.12@HEAD-detached-at-v1.5.12
  radical.gtod         : 1.5.0
  radical.pilot        : 1.5.12
  radical.saga         : 1.5.9
  radical.utils        : 1.5.12

Client zip

client.session.zip

Session zip

sandbox.session.zip

andre-merzky commented 2 years ago

Ok I can run things on Traverse using this setup. But there are some things I have learnt: ... For some reason, I cannot request a pool of GPUs and take from it.

I am not sure I appreciate the distinction - isn't 'this setup' also using GPUs from a pool of requested GPUs?

Given the first statement (I can run things on Traverse using this setup), it sounds like we should encode just this in RP to get you running on Traverse, correct?

lsawade commented 2 years ago

Well, I'm not quite sure. It seems to me that if I request, #SBATCH --gpus-per-task=1 I already prescribe how many GPUs a task uses, which worries me. Maybe it's a misunderstanding on my end..

andre-merzky commented 2 years ago

This batch script here does not use that directive. The sbatch only needs to provision the right number of nodes - the per_task parameters should not matter (even if you need to specify it in your case for some reason) as we overwrite them in the srun directives anyway?

lsawade commented 2 years ago

Exactly! But this does not seem to work!

SBATCH -n 4
SBATCH --gpus-per-task=1

srun -n 4 --gpus-per-task=1 a.o

works;

SBATCH -n 4
SBATCH -gpus=4

srun -n 4 --gpus-per-task=1 a.o

does not work!

Unless, I'm making a dumb mistake ...

andre-merzky commented 2 years ago

Sorry, I did not work on this further, yet.

andre-merzky commented 2 years ago

Hi @lsawade - I still can't make sense of it and wasn't able to reproduce it on other Slurm clusters :-( But either way, please do give the RS branch fix/traverse (https://github.com/radical-cybertools/radical.saga/pull/840) a try. It now hardcodes the #SBATCH --gpus-per-task=1 for Traverse.

lsawade commented 2 years ago

Hi @andre-merzky - So, I was getting errors in the submission, and I finally had a chance to go through the log. And, I found the error, the submitted SBATCH script can't work like this:

#SBATCH --ntasks=32
#SBATCH --ntasks-per-node=32
#SBATCH --gpus-per-task=1
#SBATCH -J "pilot.0000"
#SBATCH -D "/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.019013.0000/pilot.0000/"
#SBATCH --output "bootstrap_0.out"
#SBATCH --error "bootstrap_0.err"
#SBATCH --partition "test"
#SBATCH --time 00:20:00

In this case, you are asking for 32 GPUs on a single node. I have no solution for this because the alternative, requesting 4 tasks seems stupid. And, research computing staff seemed to be immovable in terms of SLURM settings on Traverse.

andre-merzky commented 2 years ago

We discussed this topic on this weeks devel call. At this point we are inclined to not support Traverse: the Slurm configuration on Traverse is contradicting the Slurm documentation, and also how other Slurm deployments work. To support Traverse we basically have to break support on other Slurm resources. We can in principle create a separate slurm_traverse launch method and pilot launcher in RP to accommodate the machine. That however is a fair amount of effort. Not insurmountable, but still, quite some work. Let's discuss on the HPC-Workflows call on how to handle this. Maybe there is also a chance to iterate with the admins (although we wanted to stay out of the business of dealing with system admins directly :-/ )

mturilli commented 2 years ago

We will have to write an executor specific to Traverse. This will require allocating specific resources and we will report back once we do some internal discussion. RADICAL remains available to discuss the configuration of new machines, in case it will be useful/needed. Meanwhile, Lucas is using Summit while waiting for Traverse to become viable with EnTK.

lsawade commented 2 years ago

@andre-merzky

Today I was working on something completely separate, but -- again -- I had issues with Traverse even for an embarrassingly parallel submission. It turned out that there seems to be an issue with how hardware threads are assigned.

If I just ask for --ntasks=5 I will not get 5 physical cores from the Power9 CPU, but rather 4 hardware threads from one core and 1 hardware thread from another. So, the CPU pool on traverse by default has size 128. I have to use the following to truly access 5 physical cores:

#SBATCH --ntasks=5
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1

I will check whether this has an impact on how we are assigning the tasks during submission.

Just an additional example to build understanding:

This

#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1

is OK.

This

#SBATCH --nodes=1
#SBATCH --ntasks=33
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1

is not OK.

lsawade commented 2 years ago

I have confirmed my suspicions. I have finally found a resource and task description that definitely works. Test scripts are located here traverse-slurm-repo, but I will summarize below:

The sbatch header:

#!/bin/bash
#SBATCH -t00:05:00
#SBATCH -N 2
#SBATCH -n 64
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1
#SBATCH --output=mixed_gpu.txt
#SBATCH --reservation=test
#SBATCH --gres=gpu:4

So, in the sbatch header, I'm explicitly asking for 32 tasks where each task has access to 4 cpus. In SLURM language Power9 hardware threads are apparently equal to cpus. Hence, each physical has to be assigned 4 CPUs. Then, I also specify that each core is only assigned a single task. Finally, instead of specifying somewhere implicitely some notion of GPU need, I simply tell slurm I want the 4 GPUs in each node with --gres=gpu:4.

If you want to provide the hostfile you will have to decorate the srun command as follows:

# Define Hostfile
export SLURM_HOSTFILE=<some_hostfile with <N> entries>

# Run command
srun --ntasks=<N> --gpus-per-task=1 --cpus-per-task=4 --ntasks-per-core=1 --distribution=arbitrary <my-executable>

dropping the --gpus-per-task if none are needed. Otherwise, if you want to let slurm handle the resource allocation, the following works as well:

srun --ntasks=$1 --gpus-per-task=1 --cpus-per-task=4 --ntasks-per-core=1 <my-executable>

again, dropping the --gpus-per-task if none are needed.

From past experience, I think this is relatively easy put into EnTK?

andre-merzky commented 2 years ago

@lsawade - thanks for you patience! In radical-saga and radical-pilot, you should now find two branches named fix/issue_138_hpcwf. They hopefully implement the right special cases for Traverse to work as expected. Would you please give them a try? Thank you!

lsawade commented 2 years ago

Will give it a whirl!

lsawade commented 2 years ago

@andre-merzky , I find the branch in the pilot but not in saga? Should I just use fix/traverse for saga?

andre-merzky commented 2 years ago

@lsawade : Apologies, I missed a push for the branch... It should be there now in RS also.

andre-merzky commented 2 years ago

Hey @lsawade - did you have the chance to look into this again?

lsawade commented 2 years ago

Sorry, @andre-merzky , I thought I had updated the issue before I started driving on Friday...

So, the issue persists. An error is still thrown when --cpus_per_task is used due to the underscores.


  python               : /home/lsawade/.conda/envs/conda-entk/bin/python3
  pythonpath           : 
  version              : 3.7.12
  virtualenv           : conda-entk

  radical.entk         : 1.14.0
  radical.gtod         : 1.13.0
  radical.pilot        : 1.13.0-v1.13.0-149-g211a82593@fix-issue_138_hpcwf
  radical.saga         : 1.13.0-v1.13.0-1-g7a950d53@fix-issue_138_hpcwf
  radical.utils        : 1.14.0

$ cat re.session.traverse.princeton.edu.lsawade.019111.0001/radical.log | grep -b10 ERROR | head -20

``` 136162-1651239844.198 : radical.saga.cpi : 715866 : 35185202950512 : DEBUG : write: [ 84] [ 82] (cd ~ && "/usr/bin/cp" -v "/tmp/rs_pty_staging_f19k3a1g.tmp" "tmp_jp8rdthi.slurm"\n) 136348-1651239844.202 : radical.saga.cpi : 715866 : 35185202950512 : DEBUG : read : [ 84] [ 60] ('/tmp/rs_pty_staging_f19k3a1g.tmp' -> 'tmp_jp8rdthi.slurm'\n) 136511-1651239844.244 : radical.saga.cpi : 715866 : 35185202950512 : DEBUG : read : [ 84] [ 1] ($) 136615-1651239844.244 : radical.saga.cpi : 715866 : 35185202950512 : DEBUG : copy done: ['/tmp/rs_pty_staging_f19k3a1g.tmp', '$'] 136745-1651239844.245 : radical.saga.cpi : 715866 : 35185202950512 : DEBUG : flush: [ 83] [ ] (flush pty read cache) 136868-1651239844.346 : radical.saga.cpi : 715866 : 35185202950512 : DEBUG : run_sync: sbatch 'tmp_jp8rdthi.slurm'; echo rm -f 'tmp_jp8rdthi.slurm' 137016-1651239844.347 : radical.saga.cpi : 715866 : 35185202950512 : DEBUG : write: [ 83] [ 61] (sbatch 'tmp_jp8rdthi.slurm'; echo rm -f 'tmp_jp8rdthi.slurm'\n) 137181-1651239844.352 : radical.saga.cpi : 715866 : 35185202950512 : DEBUG : read : [ 83] [ 91] (sbatch: unrecognized option '--cpus_per_task=4'\nTry "sbatch --help" for more information\n) 137375-1651239844.352 : radical.saga.cpi : 715866 : 35185202950512 : DEBUG : read : [ 83] [ 36] (rm -f tmp_jp8rdthi.slurm\nPROMPT-0->) 137514-1651239844.352 : radical.saga.cpi : 715866 : 35185202950512 : DEBUG : submit SLURM script (tmp_jp8rdthi.slurm) (0) 137636:1651239844.352 : radical.saga.cpi : 715866 : 35185202950512 : ERROR : NoSuccess: Couldn't get job id from submitted job! sbatch output: 137779-sbatch: unrecognized option '--cpus_per_task=4' 137827-Try "sbatch --help" for more information 137868-rm -f tmp_jp8rdthi.slurm 137893- 137894:1651239844.354 : pmgr_launching.0000 : 715866 : 35184434934128 : ERROR : bulk launch failed 137990-Traceback (most recent call last): 138025- File "/home/lsawade/.conda/envs/conda-entk/lib/python3.7/site-packages/radical/pilot/pmgr/launching/default.py", line 405, in work 138158- self._start_pilot_bulk(resource, schema, pilots) 138211- File "/home/lsawade/.conda/envs/conda-entk/lib/python3.7/site-packages/radical/pilot/pmgr/launching/default.py", line 609, in _start_pilot_bulk ```

mtitov commented 2 years ago

@lsawade hi Lucas, can you please give it another try, since that was a typo in option setup and was fixed in that branch, thus the stack would look like this

% radical-stack           

  python               : /Users/mtitov/.miniconda3/envs/test_rp/bin/python3
  pythonpath           : 
  version              : 3.7.12
  virtualenv           : test_rp

  radical.entk         : 1.14.0
  radical.gtod         : 1.13.0
  radical.pilot        : 1.14.0-v1.14.0-119-ga6886ca58@fix-issue_138_hpcwf
  radical.saga         : 1.13.0-v1.13.0-9-g1875aa88@fix-issue_138_hpcwf
  radical.utils        : 1.14.0

andre-merzky commented 2 years ago

@lsawade : ping :-)

radical-collaboration / hpc-workflows