Open lsawade opened 3 years ago
Hi @lsawade - this is a surprising one. The task stdout shows:
$ cat *err
srun: Job 126172 step creation temporarily disabled, retrying (Requested nodes are busy)
This one does look like a slurm problem. Is this reproducible?
Reproduced! The message with step creation appears after a while. Meaning I continuously checked the task's error file, and eventually the message showed up!
@lsawade , would you please open an ticket with Traverse support? Maybe our srun command is not well-formed for Traverse's Slurm installation? Please include the srun command:
/usr/bin/srun --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --nodelist=/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.018666.0003/pilot.0000/unit.000000//unit.000000.nodes --export=ALL,NODE_LFS_PATH="/tmp" write-sources "-f" "/tigress/lsawade/entkdatabase/C200709121110A/C200709121110A.cmt" "-p" "/home/lsawade/gcmt3d/workflow/params"
and the nodelist file which just contains:
traverse-k04g10
It throws following error:
srun: error: Unable to create step for job 126202: Requested node configuration is not available
If I take out the nodelist
argument, it runs
Hmm, is that node name not valid somehow?
I tried running it with the nodename as a string and that worked
/usr/bin/srun --nodelist=traverse-k05g10 --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --export=ALL,NODE_LFS_PATH="/tmp" write-sources "-f" "/tigress/lsawade/entkdatabase/C200709121110A/C200709121110A.cmt" "-p" "/home/lsawade/gcmt3d/workflow/params"
Note that I'm using salloc and hence a different nodename
I found the solution. When SLURM takes in a file for a nodelist, one has to use the node file option:
/usr/bin/srun --nodefile=nodelistfile --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --export=ALL,NODE_LFS_PATH="/tmp" write-sources "-f" "/tigress/lsawade/entkdatabase/C200709121110A/C200709121110A.cmt" "-p" "/home/lsawade/gcmt3d/workflow/params"
Oh! Thanks for tracking that down, we'll fix this!
It is puzzling though, that srun
doesn't throw an error. When I do it by hand, srun
throws an error when feeding a nodelist file to the --nodelist=
option
@lsawade : the fix has been released, please let us know if that problem still happens!
@andre-merzky, will test!
Sorry, for the extraordinarily late feedback, but the issue seems to persist. It already hangs in the Hello, World
task.
Did I update correctly?
My stack:
My script:
Tarball:
Bugger... - the code though is using --nodefile=
:
$ grep srun task.0000.sh
task.0000.sh:/usr/bin/srun --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --nodefile=/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.018719.0005/pilot.0000/task.0000//task.0000.nodes --export=ALL,NODE_LFS_PATH="/tmp" /bin/echo "Hello world!"
but that task indeed never returns. Does that line work on an interactive node? FWIW, task.0000.nodes
contains:
$ cat task.0000.nodes
traverse-k02g1
Yes, in interactive mode and change of the nodefile to the node I land on it works
Edit: In my interactive job I'm using one node only, let me try with two...
It works also when using the two nodes in the interactive job and editing the task.0000.nodes
to contain one of the accessible nodes. Either node works, so this does not seem to be the problem.
Hmm, where does that leave us... - so it is not the srun
command format which is at fault after all?
Can you switch your workload to, say, /bin/date
to make sure we are not looking at the wrong place, and that the application code behaves as expected when we run under EnTK?
Would you mind running one more test: interactively get two nodes, and run the command towards the other node than the one you land on.
See Update above
You should see the allocated nodes via
cat $SLURM_NODEFILE
or something like that (env | grep SLURM
will be helpful)
echo $SLURM_NODELIST
works, I don't seem to have the nodefile
environment variable.
What do you mean with switching my workload to /bin/date
?
I also tested running the entire task.0000.sh
in interactive mode, and it had no problem.
Slurm on Traverse seems to be working in a strange way. Lucas is in contact with the research service at Princeton.
Two things that have come up:
srun
command needs a -G0
flag (no GPUs) if a non-gpu task is executed with a resource set that contains GPUs. The command will only hang if the resource set contains GPUs and run otherwise. !
my hello world task also encountered issues because I didn't properly escape the !
in "Hello, World!"
. Use "Hello, World\!" instead. facepalmMost quick debugging discussions were held on Slack but here a summary for posterity:
@andre-merzky published a quick fix for the srun
command one of the RP branches (https://github.com/radical-cybertools/radical.pilot/commit/aee4fb8862fa4fbf55589a23a9cc0c66ee839d40), but there is KeyError
that is issued by EnTK when calling something from the pilot.
My apologies, that error is now fixed in RP.
Getting a new one again!
EnTK session: re.session.traverse.princeton.edu.lsawade.018720.0008
Creating AppManagerSetting up RabbitMQ system ok
ok
Validating and assigning resource manager ok
Setting up RabbitMQ system n/a
new session: [re.session.traverse.princeton.edu.lsawade.018720.0008] \
database : [mongodb://specfm:****@129.114.17.185/specfm] ok
create pilot manager ok
submit 1 pilot(s)
pilot.0000 princeton.traverse 16 cores 8 gpus ok
closing session re.session.traverse.princeton.edu.lsawade.018720.0008 \
close pilot manager \
wait for 1 pilot(s)
0 ok
ok
session lifetime: 16.1s ok
wait for 1 pilot(s)
0 timeout
All components terminated
Traceback (most recent call last):
File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 428, in run
self._rmgr._submit_resource_request()
File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/execman/rp/resource_manager.py", line 177, in _submit_resource_request
self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
File "/home/lsawade/thirdparty/python/radical.pilot/src/radical/pilot/pilot.py", line 558, in wait
time.sleep(0.1)
KeyboardInterrupt
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "solver.py", line 104, in <module>
appman.run()
File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 453, in run
raise KeyboardInterrupt from ex
KeyboardInterrupt
So the way I install entk and the pilot at the moment is as follows:
# Install EnTK
conda create -n conda-entk python=3.7 -c conda-forge -y
conda activate conda-entk
pip install radical.entk
(Note, I'm not changing pilot here and just keep the default one.)
Then, I get the radical.pilot
repo to create the static ve.rp
. Log out, log in,
# Create environment
module load anaconda3
conda create -n ve -y python=3.7
conda activate ve
# Install Pilot
git clone git@github.com:radical-cybertools/radical.pilot.git
cd radical.pilot
pip install .
# Create static environment
./bin/radical-pilot-create-static-ve -p /scratch/gpfs/$USER/ve.rp/
Log out, Log in:
conda activate conda-entk
python workflow.py
Are there any news here?
@andre-merzky ?
Alright, I got the workflow manager to -- at least -- run. Not hanging, yay
One of the issues is that when I create the static environment using radical-pilot-create-static-ve
, it does not install any dependencies, so I installed all requirements into the ve.rp
.
However, I'm sort of back to square one. A serial task executes, and task.0000.out
has "Hello World" in it, and the log shows that task.0000
does return with a 0
exit code, but it also fails as indicated by the workflow manager and task.0000.err
contains following line:
cpu-bind=MASK - traverse-k01g10, task 0 0 [140895]: mask 0xf set
I'll attach the tarball.
It is also important to state that the Manager seems to drop scheduling other jobs upon failure of the first task. I wasn't able to find anything about it in the log.
@andre-merzky ping
hi @lsawade , that message in err
file looks like a verbose message and doesn't indicate an error
and some additional comments:
(a) radical-pilot-create-static-ve
: for dependencies there is an extra option -d
(== set default modules)
(b) if for running your workflow you use a shared FS (using the same virtual env for client and pilot), then you can set that in the resource config, e.g.,
"python_dist" : "anaconda", # there are two options: "default" or "anaconda" (for conda env)
"virtenv_mode" : "use",
"virtenv" : <name/path>, # better to use a full path
"rp_version" : "installed", # if RCT packages are pre-installed
@andre-merzky, just as a side comment, with pre-set resource configs should we set default value for python_dist
as anaconda
? (since there is module load anaconda3
in pre_bootstrap_0
)
Hi @mtitov , thank for getting back to me. Aah, I missed that when installing the static-ve
.
that message in err file looks like a verbose message and doesn't indicate an error
That's what I thought, too. I mean the tasks finishes successfully (STDOUT
Is fine). It just flags itself as failed when I run the appmanager
. So, I'm a bit unsure why the Task
fails.
Yeah, that what I missed, so task has the final state FAILED
and has it after TMGR_STAGING_OUTPUT
, thus I assume something went wrong on client side. @lsawade can you please attach client sandbox as well?
Sorry I only saw the notification now, attached the corresponding client sandbox.
hi @lsawade , thank you for a sandbox, looks like the issue is with the name of the output: by default RP sets the name of task output as <task_uid>.out
(and it is similar for err-file), before we had it as STDOUT
for all tasks. For now if you want to collect corresponding outputs without using task ids, then output file name could be set explicitly, thus:
t = Task()
t.stdout = 'STDOUT'
...
t.download_output_data = ['STDOUT']
(*) With your run everything went fine, just at the end TaskManager couldn't collect STDOUT
Lord, if that ends up being the final issue, that would be wild... Let me test this later today, and I will get back to you!
So, I tested stuff yesterday, and things seem to work out! There is one catch that is probably solvable. When I have need GPUs from different nodes I feel like the mpirun in the task.000x.sh
has to fail because it does not know which GPUs to use. Meaning, I want to run 2 instances of specfem simultaneously, each needs 6 GPUs, but I only have 4 GPUs per node and am running on 3 nodes (12 GPUs total). That means there is an overlap in nodes, which I don't think/am not sure about mpirun can handle by itself?
Task 1:
mpirun -np 6 -host traverse-k04g9,traverse-k04g9,traverse-k05g10,traverse-k05g10,traverse-k05g10,traverse-k05g10 -x ...
Task 2:
mpirun -np 6 -host traverse-k04g7,traverse-k04g7,traverse-k04g7,traverse-k04g7,traverse-k04g9,traverse-k04g9 -x ...
Note that both use traverse-k04g9
but in the rest of the executable, there is no sign of which GPU is supposed to be used, and both Tasks never execute.
Update:
I tried to run the mpirun
line in interactive mode and it hangs. I do not know why. It even hangs when I do not specify the nodes. But(!), this one does not:
srun -n 6 --gpus-per-task=1 ./bin/xspecfem3D
Just a quick update. I'm still looking for a work around here and in contact with the Research computing people here.
srun -n 6 --gpus=6 <some test task>
works, but when I do
srun -n 6 --gpus=6 ./bin/xspecfem3D
it doesn't. Very curious, but I'm on it, and will put more info here eventually.
Just a quick update, the above described commands are executed differently depending on the cluster at hand in Princeton. Meaning that it will be hard to generalize Slurm submission. I have been talking to people from picscie, there is no obvious solution right now. I will get back here again once I have more info.
@lsawade to provide an example that we can test on other clusters with SLURM.
I cannot test whether this would work, but below an example that I expect to work if slurm
is configured correctly.
The jobs submit, just not in parallel. This submission setup is for 3 Nodes, where each node has 4 GPUs, and two gpu-requiring srun
s have to be executed, each with 6 tasks and a 1 gpu per task. For this setup to run in parallel, the two srun
s would have to share a node.
Let
#!/bin/bash
#SBATCH -t00:05:00
#SBATCH --gpus 12
#SBATCH -n 12
#SBATCH --output=mixed_gpu.txt
module load openmpi/gcc cudatoolkit
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 0 &
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 1 &
wait
where show_devices.sh
:
#!/bin/bash
echo Script $1
echo JOB $SLURM_JOB_ID STEP $SLURM_STEP_ID
echo $CUDA_VISIBLE_DEVICES
sleep 60
Output of mixed_gpu
should look somewhat like this:
and the job steps 203755.0
and 203755.1
should start at roughly the same time, unlike here:
sacct --format JobID%20,Start,End,Elapsed,ReqCPUS,JobName%20, -j 203755
ping
Hi @lsawade, I will have time on Friday to work on this and hope to have results back before our call.
Hey @lsawade - the reason for the behavior eludes me completely. I can confirm that the same is observed on at least one other Slurm cluster (expanse @ SDSC), and I opened a ticket there to hopefully get some useful feedback. At the moment I simply don't know how we can possibly resolve this. I am really sorry for that, I understand that this is blocking progress since several months by now :-/
Yeah, I have had a really long thread with the people from the research computing group and they did not understand why this is not working either. Maybe we should contact the slurm
people?
Yes, I think we should resort to that. I'll open a ticket if the XSEDE support is not able to suggest a solution within a week.
We got some useful feedback from XSEDE after all: slurm seems indeed to be unable to do correct auto-placement for non-node-local tasks. I find this surprising, and it may still be worthwhile to open a slurm ticket about this. Either way though: a workaround is to start the job with a specific node file. From your example above:
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 0 &
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 1 &
should work as expected with
export SLURM_HOSTFILE=host1.list
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 --distribution=arbitrary show_devices.sh 0 &
export SLURM_HOSTFILE=host2.list
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 --distribution=arbitrary show_devices.sh 1 &
where the host file look like, for example:
$ cat host2.list
exp-1-57
exp-1-57
exp-6-58
exp-6-58
exp-6-58
exp-6-58
Now, that brings us back to RP / EnTK: we actually do use a hostfile, we just miss out on --distribution=arbitrary
flag. Before we include that, could you please confirm that the above also in fact works on Traverse please?
Hi @andre-merzky,
I have been playing with this and I can't seem to get it to work. I explain what I do here: https://github.com/lsawade/slurm-job-step-shared-res
I'm not sure whether it's me or Traverse.
Can you adjust this mini example to see whether it runs on XSEDE? Things you would have to change are the automatic writing of the hostfile and how many tasks per job step. If you give me the hardware setup of XSEDE, I could also adjust the script and give you something that should run out of the box to check.
The hardware setup on Expanse is really similar to Traverse: 4 GPUs/node.
I pasted something incorrect above, apologies! Too many scripts lying around :-/ The --gpus=6
flag was missing. Here should be the correct one, showing the same syntax working for both cyclic
and block
:
This is the original script:
These are the resulting node files:
and these the resulting outputs:
So, I have some good news, I have also tested this on Andes
, and it definitely works on Andes
as well. An added batch_andes.sh
batch script to the repo to test the arbitrary distribution for cyclic
and block
with nodes [1,2], [1,2] and [1], [1,2,2], respectively.
The annoying news are that it does not seem to work on Traverse. At least I was able to test whether it's a user error...
So, how do we proceed? I'm sure it's a setting in the slurm
setup. Do we open a ticket with the Andes/Expanse support? I'll for sure open a ticket with PICSciE and see whether they can find a solution.
UPDATE:
The unexpected/unwanted output on Traverse:
Does it almost look like there is a misunderstanding between slurm and cude, the devices visible should not be all CUDA_VISIBLE_DEVICES
?
PS: I totally stole the way you made the block and cyclic functions as well as the printing. Why did I not think of that...?
Ok I can run things on Traverse using this setup. But there are some things I have learnt:
One traverse to not give a job step the entire CPU affinity of the involved nodes, I have to use the --exclusive
flag in srun, which indicates that certain cpus/cores are exclusively used by that job step and not anything else.
Furthermore, I cannot use --cpus-per-task=1
. Which makes a lot of sense, and CPU affinity prints should have rang a bell for me. I feel dense.
So, at request, I ask SBATCH like so:
#SBATCH -n 8
#SBATCH --cpus-per-task=4
#SBATCH --gpus-per-task=1
and then
srun --ntasks=4 --gpus-per-task=1 --cpus-per-task=4 --distribution=arbitrary --exclusive script.sh
or even
srun --ntasks=4 --distribution=arbitrary --exclusive script.sh
would work.
What does not work is the following:
...
#SBATCH -n 8
#SBATCH -G 8
srun --ntasks=4 --gpus-per-task=1 --cpus-per-task=4 --distribution=arbitrary --exclusive
For some reason, I cannot request a pool of GPUs and take from it.
Hi,
I don't know whether this is related to #135 . It is weird because I got everything running on a single node, but as soon as I use more than one EnTK seems to hang. I checked out the submission script and it looks fine to me; so, did the node list.
The workflow already hangs in the submission of the first task, which is a single core, single thread task.
Stack
Client zip
client.session.zip
Session zip
sandbox.session.zip