Closed mturilli closed 4 years ago
A few more comments and questions on the radical.saga
19.05.0
. #SBATCH -N
is required in the header.Also, the working_directory
is #SBATCH --chdir
in slurm 19.05.0
.
#!/bin/bash
#SBATCH -A GEO111
#SBATCH -J process_data
#SBATCH -N 4
#SBATCH -t 1:00:00
srun -n16 -N1-1 -r0 process.py file1 srun -n16 -N1-1 -r1 process.py file2 srun -n16 -N1-1 -r2 process.py file2 srun -n16 -N1-1 -r3 process.py file2
Let me explain more. In my usual case, I use one node(16 cores, 16 mpis) to process one data file. So in my header file I only care about the total number of nodes (which is equal to the number of files get processed simultaneously). I don't put `--ntasks-per-node=%s` or `--cpus-per-task=%s` in the header. I think one of the two values will be specified in `radical.saga` in default and I don't know how to set those values correctly (or if I need it at all).
The srun command `srun -n16 -N1-1 -r0 process.py file1` will ensure one srun will use 16 mpis calls, at exactly one node(-N1-1), and use the 0 (indexed) node to run the `process.py file1`
Please help me to understand it so I can set the right values in the job description.
I have another question about the logger in radical.saga
.
Based on the tutorial, I can set the logger by:
SAGA_VERBOSE=DEBUG SAGA_LOG_TARGETS=STDOUT,/tmp/mysaga.log python mysagaprog.py
However, based on my test runs, the command is not effective at all.
When I run test scripts using radical.saga
, there are some log files generated in the running directory.
-rw-rw-r-- 1 lei lei 0 Nov 24 11:16 radical.saga.api.log
-rw-rw-r-- 1 lei lei 0 Nov 24 11:16 radical.saga.cpi.log
-rw-rw-r-- 1 lei lei 0 Nov 24 11:16 radical.saga.log
-rw-rw-r-- 1 lei lei 0 Nov 24 11:16 radical.saga.pty.log
-rw-rw-r-- 1 lei lei 0 Nov 24 11:16 radical.utils.log
However, there is nothing shown in those log files.
I noticed the radical.saga
now used the radical.utils.logger
, instead of its own logger. Not sure if the changes affect the logger.
I want to know where are the logger files and how to change the logger level.
@wjlei1990 thank you for the feedback.
I added @andre-merzky to the ticket as there might be some issues with radical.saga
that are not specific to reha
:
@lee212 please feel free to open a ticket for each issue specifically related to rhea
in the radical.saga
repository referencing this ticket.
Hi @wjlei1990 ,
for the logging to work, setting RADICAL_LOG_LVL=DEBUG
should be all you need. With the naming transition from saga-python
to radical.saga
a while back we were able to unify those settings across the stack, at last! The SAGA_*
variables are not supported anymore.
Similarly, I see that you mention the saga-python
module in your initial post - that still exists from some users which need the old version, but you should not need to install that. Instead, radical.entk
and radical.pilot
should automatically pull radical.saga
as dependency. If you need to install manually, please use that module name.
Can you send the output of radical-stack
, please, to confirm what your resulting installation is?
As for the actual problem you see (submission on Rhea): you are trying to run
js = saga.job.Service("fork://localhost")
but that won't work on rhea, as that will land the jobs on the head node. You probably want to use:
js = saga.job.Service("slurm://rhea.ccs.ornl.gov/")
Please do use the fully qualified hostname, as we have some rhea specific code path in radical.saga (but see below).
@lee212 : I seem to remember that you tested the SAGA slurm adaptor on Rhea, and that it worked - is that correct? But I don't see the respective checks in the slurm adaptor... Can you please confirm with what version you did those tests? Thanks!
Hi Andre,
Thanks for your feedback.
For the logging, it works!
I install the radical.saga
from source code that got cloned from the github repo(and then pip install .
). I think I am using the most recent master
branch. Below is the output from radical-stack
.
(radical) lei@rhea-login1g ~/test/radical/slurm_work $
radical-stack
python : 3.7.3
pythonpath : /sw/rhea/xalt/1.1.3/site:/sw/rhea/xalt/1.1.3/libexec
virtualenv : radical
radical.saga : 0.90.0-bv0.72.0-46-g57bc8dd0@devel
radical.utils : 0.90.3
Also, in my script I am using:
import radical.saga as saga
saga.job.Service("slurm://rhea.ccs.ornl.gov/")
will gives me errors.
*** Backtrace:
File "test_adaptor.py", line 22, in <module>
sys.exit(main())
File "test_adaptor.py", line 11, in main
js = saga.job.Service("slurm://rhea.ccs.ornl.gov/")
File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/job/service.py", line 116, in __init__
url, session, ttype=_ttype)
File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/base.py", line 113, in __init__
**kwargs)
File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/adaptors/cpi/decorators.py", line 62, in wrap_function
return sync_function (self, *args, **kwargs)
File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/adaptors/slurm/slurm_job.py", line 297, in init_instance
self._open()
File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/adaptors/slurm/slurm_job.py", line 348, in _open
self.shell = rsups.PTYShell(shell_url, self.session, self._logger)
File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/utils/pty_shell.py", line 244, in __init__
interactive=self.interactive)
File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 174, in initialize
posix, interactive)
File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 584, in _create_master_entry
% (url.schema, url.host))
File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/exceptions.py", line 204, in _log
return cls (msg, parent=parent, api_object=api_object, from_log=True)
File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/exceptions.py", line 356, in __init__
SagaException.__init__ (self, msg, parent, api_object, from_log)
Here is my routine. I logged onto RHEA and run radical.saga
jobs on the RHEA logging node. I think js = saga.job.Service("slurm://")
works for me, since I do see jobs submitted to the queue and it starts to run and exit succesfully.
When you talk about using saga.job.Service("slurm://rhea.ccs.ornl.gov/")
, do you mean I launch saga scripts remotly (like on my desktop) or on rhea login nodes?
Also, with regarding to my previous question, could you tell me in my scenario and use case, how to set --ntasks-per-node
or --cpus-per-task
correctly? Or do I even need it or not?
Another question to job description. On Rhea, Do I need to put:
jd.spmd_variation = "mpi"
It seems I don't need it, since I am going to launch srun
myself.
@andre-merzky , I can't get my rp stack working on rhea, my env is failed with the error:
caught Exception: [Errno 28] No space left on device
This may be caused by retired lustre altas
but I need to find out how to resolve this because updating json configuration file with Alpine GPFS didn't fix the problem. This is for py3, not py2 BTW.
@lee212 might also be your home(or where you run the script) is running out of space :)
!!! Urgent...the code is complaining on RHEA Could not detect shell prompt (timeout)
:
An exception occured: (NoSuccess) Could not detect shell prompt (timeout) (/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py +293 (_initialize_pty) : raise rse.NoSuccess("Could not detect shell prompt (timeout)"))
*** Backtrace:
File "test_slurm.py", line 66, in <module>
sys.exit(main())
File "test_slurm.py", line 12, in main
js = saga.job.Service("slurm://")
File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/job/service.py", line 116, in __init__
url, session, ttype=_ttype)
File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/base.py", line 113, in __init__
**kwargs)
File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/adaptors/cpi/decorators.py", line 62, in wrap_function
return sync_function (self, *args, **kwargs)
File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/adaptors/slurm/slurm_job.py", line 297, in init_instance
self._open()
File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/adaptors/slurm/slurm_job.py", line 348, in _open
self.shell = rsups.PTYShell(shell_url, self.session, self._logger)
File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/utils/pty_shell.py", line 244, in __init__
interactive=self.interactive)
File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 199, in initialize
self._initialize_pty(info['pty'], info)
File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 293, in _initialize_pty
raise rse.NoSuccess("Could not detect shell prompt (timeout)")
File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/exceptions.py", line 446, in __init__
SagaException.__init__ (self, msg, parent, api_object, from_log)
Any idea how to fix it?
Hi @wjlei1990 : I am not sure what's up with the shell prompt problem - is that repeatable? What exactly are you running?
The SAGA slurm adaptor is now fixed for Rhea (in the saga branch fix/rhea
), but I need to iterate on RP and EnTK a bit more.
Hi @andre-merzky , it seems to always fail in one of my virtual environment (created by conda)...It just keeps trying and timeout. However, it works for some other virtual enviroment. I gues it would be my virtual environment issue.
However, even in some virtual envs it works, it still tries multiple times to get the shell prompt. Do you think you could be able to fix it anyway. I guess this is not that urgent since I can still use it...
Hi @wjlei1990 : This should not depend on the virtualenv, or at least I can't see how. Can you send me the command or code you are running please, and I'll try to reproduce it. You can also try to set
export RADICAL_SAGA_PTY_LOG_LVL=DEBUG
export RADICAL_SAGA_PTY_LOG_TGT=pty.log
and attach the resulting pty.log
to this issue - this should trace the shell interactions. Last but not least, you can set
export RADICAL_SAGA_PTY_SSH_TIMEOUT=60
to see if this makes a difference (the default is 10 seconds).
PS: Thanks for the help. Since I can get it to work in some virtual envs, so this is NOT a urgent issue.
This seems doesn't work for me, at least it doesn't generate the pty.log
file for me.
export RADICAL_SAGA_PTY_LOG_LVL=DEBUG
export RADICAL_SAGA_PTY_LOG_TGT=pty.log
So I just used the RADICAL_LOG_LVL=DEBUG
.
The successful case is here, in the file radical.saga.cpi.log
. Even it is successful, I think it tries multiple times(2 or 3 times) to find the shell prompt (between line 10 to 15)
radical.saga.cpi.log
Here is a bad case(in another vritual env).
Starting from line 10, you can see it keeps tring..utils it reaches max try and failed. radical.saga.cpi.log
I also zip all the logs if it helps you locate the issue. slurm_work.bad.zip
Anyway, since this is working now(at least in some virtual env), I don't think this is an urgent issue.
Also, all the tests I have made is based on the master
branch, not the fix/rhea
.
Thanks for the logs, @wjlei1990 !
What happens if you run this on the command line:
/usr/bin/env TERM=vt100 "/bin/bash" -i
This should result in a new shell running rather quickly:
rivendell merzky ~ $ date +%s.%N; echo "date +%s.%N" | /usr/bin/env TERM=vt100 "/bin/bash" -i
1575365236.933453427
rivendell merzky ~ $ date +%s.%N
1575365237.456314561
Did you try
export RADICAL_SAGA_PTY_SSH_TIMEOUT=60
Can you show me the output of
echo $PS1
echo $PS2
echo $PS3
echo $PS4
echo $PROMPT_COMMAND
Thanks!
It seems the fix you put into the code here doesn't work for me.
if 'rhea' in self.rm.host.lower():
In my case, the self.rm.host.lower()
will get the value localhost
. It is not *rhea*
...
Could you confirm it?
@andre-merzky @lee212 -- can you confirm @wjlei1990 query please ?
Thanks for the ping!
RP should by now have config files which use the FQHN in the job access URLs, even for local access. Can you please check if this is the case in the RP version you have installed? If not, you may want to either change the respective config entries (python2) or update RP (python3).
Got it. Thanks. Let me check it an I will post updates here.
This works now, closing
Hi Matteo,
I just tried
radical.saga
on RHEA and there is some small issues.The thing is I reinstalled everything(new conda and new virtual env with conda), and install the saga in the brand new conda virtual env:
Then I just write a simple script to test if
saga
works on rhea. My script is below:Then run the code and the code gives me error instantly:
I looked into the code and find in
timeout
is a string. Ok, I searched where thetimeout
is assigned and found that it is assigned here:Then I changed the above line to:
It seems this issue is gone. However, on RHEA,
radical.saga
still tries multiple times get runpty_shell.find
here. However, I read the comment and it said:Not sure if it is a issue or not.
If I make changed mentioned above, the
js = saga.job.Service("fork://localhost")
works and I can run small test scripts for local jobs.