radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Support Rhea with SLURM #99

Closed mturilli closed 4 years ago

wjlei1990 commented 4 years ago

Hi Matteo,

I just tried radical.saga on RHEA and there is some small issues.

The thing is I reinstalled everything(new conda and new virtual env with conda), and install the saga in the brand new conda virtual env:

pip install saga-python

Then I just write a simple script to test if saga works on rhea. My script is below:

import radical.saga as saga 
js = saga.job.Service("fork://localhost")

Then run the code and the code gives me error instantly:

expand iterable ['radical.saga.adaptors.context.myproxy', 'radical.saga.adaptors.context.x509', 'radical.saga.adaptors.context.ssh', 'radical.saga.adaptors.context.userpass', 'radical.saga.adaptors.shell.shell_job', 'radical.saga.adaptors.shell.shell_file', 'radical.saga.adaptors.shell.shell_resource', 'radical.saga.adaptors.redis.redis_advert', 'radical.saga.adaptors.sge.sgejob', 'radical.saga.adaptors.lsf.lsfjob', 'radical.saga.adaptors.condor.condorjob', 'radical.saga.adaptors.slurm.slurm_job', 'radical.saga.adaptors.http.http_file', 'radical.saga.adaptors.aws.ec2_resource', 'radical.saga.adaptors.loadl.loadljob', 'radical.saga.adaptors.globus_online.go_file', 'radical.saga.adaptors.torque.torquejob', 'radical.saga.adaptors.pbspro.pbsprojob', 'radical.saga.adaptors.srm.srmfile', 'radical.saga.adaptors.cobalt.cobaltjob']
find... None
Traceback (most recent call last):
  File "test_radical.py", line 3, in <module>
    js = saga.job.Service("fork://localhost")
  File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/job/service.py", line 116, in __init__
    url, session, ttype=_ttype)
  File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/base.py", line 113, in __init__
    **kwargs)
  File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/adaptors/cpi/decorators.py", line 62, in wrap_function
    return sync_function (self, *args, **kwargs)
  File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/adaptors/shell/shell_job.py", line 472, in init_instance
    self._logger, opts=self.opts)
  File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/utils/pty_shell.py", line 244, in __init__
    interactive=self.interactive)
  File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 199, in initialize
    self._initialize_pty(info['pty'], info)
  File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 421, in _initialize_pty
    raise ptye.translate_exception (e)
  File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 293, in _initialize_pty
    if time.time() - time_start > timeout:
TypeError: '>' not supported between instances of 'float' and 'str'

I looked into the code and find in timeout is a string. Ok, I searched where the timeout is assigned and found that it is assigned here:

timeout    = info['ssh_timeout']

Then I changed the above line to:

timeout    = float(info['ssh_timeout'])

It seems this issue is gone. However, on RHEA, radical.saga still tries multiple times get run pty_shell.find here. However, I read the comment and it said:

# we found none of the prompts, yet, and need to try
# again.  But to avoid hanging on invalid prompts, we
...

Not sure if it is a issue or not.

If I make changed mentioned above, the js = saga.job.Service("fork://localhost") works and I can run small test scripts for local jobs.

wjlei1990 commented 4 years ago

A few more comments and questions on the radical.saga

  1. The slurm verion on RHEA is 19.05.0. #SBATCH -N is required in the header.

https://github.com/radical-cybertools/radical.saga/blob/devel/src/radical/saga/adaptors/slurm/slurm_job.py#L493

Also, the working_directory is #SBATCH --chdir in slurm 19.05.0.

https://github.com/radical-cybertools/radical.saga/blob/devel/src/radical/saga/adaptors/slurm/slurm_job.py#L555

  1. I am not sure how to set the header information in my job script. Here is an example of sbatch script I used for my own case:
    
    #!/bin/bash
    #SBATCH -A GEO111
    #SBATCH -J process_data
    #SBATCH -N 4
    #SBATCH -t 1:00:00

srun -n16 -N1-1 -r0 process.py file1 srun -n16 -N1-1 -r1 process.py file2 srun -n16 -N1-1 -r2 process.py file2 srun -n16 -N1-1 -r3 process.py file2



Let me explain more. In my usual case, I use one node(16 cores, 16 mpis) to process one data file. So in my header file I only care about the total number of nodes (which is equal to the number of files get processed simultaneously). I don't put `--ntasks-per-node=%s` or `--cpus-per-task=%s` in the header. I think one of the two values will be specified in `radical.saga` in default and I don't know how to set those values correctly (or if I need it at all).

The srun command `srun -n16 -N1-1 -r0 process.py file1` will ensure one srun will use 16 mpis calls, at exactly one node(-N1-1), and use the 0 (indexed) node to run the `process.py file1`

Please help me to understand it so I can set the right values in the job description.
wjlei1990 commented 4 years ago

I have another question about the logger in radical.saga.

Based on the tutorial, I can set the logger by:

SAGA_VERBOSE=DEBUG SAGA_LOG_TARGETS=STDOUT,/tmp/mysaga.log python mysagaprog.py

However, based on my test runs, the command is not effective at all.

When I run test scripts using radical.saga, there are some log files generated in the running directory.

-rw-rw-r-- 1 lei lei    0 Nov 24 11:16 radical.saga.api.log
-rw-rw-r-- 1 lei lei    0 Nov 24 11:16 radical.saga.cpi.log
-rw-rw-r-- 1 lei lei    0 Nov 24 11:16 radical.saga.log
-rw-rw-r-- 1 lei lei    0 Nov 24 11:16 radical.saga.pty.log
-rw-rw-r-- 1 lei lei    0 Nov 24 11:16 radical.utils.log

However, there is nothing shown in those log files.

I noticed the radical.saga now used the radical.utils.logger, instead of its own logger. Not sure if the changes affect the logger.

I want to know where are the logger files and how to change the logger level.

mturilli commented 4 years ago

@wjlei1990 thank you for the feedback.

I added @andre-merzky to the ticket as there might be some issues with radical.saga that are not specific to reha:

@lee212 please feel free to open a ticket for each issue specifically related to rhea in the radical.saga repository referencing this ticket.

andre-merzky commented 4 years ago

Hi @wjlei1990 ,

for the logging to work, setting RADICAL_LOG_LVL=DEBUG should be all you need. With the naming transition from saga-python to radical.saga a while back we were able to unify those settings across the stack, at last! The SAGA_* variables are not supported anymore.

Similarly, I see that you mention the saga-python module in your initial post - that still exists from some users which need the old version, but you should not need to install that. Instead, radical.entk and radical.pilot should automatically pull radical.saga as dependency. If you need to install manually, please use that module name.

Can you send the output of radical-stack, please, to confirm what your resulting installation is?

As for the actual problem you see (submission on Rhea): you are trying to run

js = saga.job.Service("fork://localhost")

but that won't work on rhea, as that will land the jobs on the head node. You probably want to use:

js = saga.job.Service("slurm://rhea.ccs.ornl.gov/")

Please do use the fully qualified hostname, as we have some rhea specific code path in radical.saga (but see below).

@lee212 : I seem to remember that you tested the SAGA slurm adaptor on Rhea, and that it worked - is that correct? But I don't see the respective checks in the slurm adaptor... Can you please confirm with what version you did those tests? Thanks!

wjlei1990 commented 4 years ago

Hi Andre,

Thanks for your feedback.

  1. For the logging, it works!

  2. I install the radical.saga from source code that got cloned from the github repo(and then pip install .). I think I am using the most recent master branch. Below is the output from radical-stack.

    (radical) lei@rhea-login1g ~/test/radical/slurm_work $
    radical-stack
    
    python               : 3.7.3
    pythonpath           : /sw/rhea/xalt/1.1.3/site:/sw/rhea/xalt/1.1.3/libexec
    virtualenv           : radical
    
    radical.saga         : 0.90.0-bv0.72.0-46-g57bc8dd0@devel
    radical.utils        : 0.90.3

Also, in my script I am using:

import radical.saga as saga
  1. Using saga.job.Service("slurm://rhea.ccs.ornl.gov/") will gives me errors.
    *** Backtrace:
    File "test_adaptor.py", line 22, in <module>
    sys.exit(main())
    File "test_adaptor.py", line 11, in main
    js = saga.job.Service("slurm://rhea.ccs.ornl.gov/")
    File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/job/service.py", line 116, in __init__
    url, session, ttype=_ttype)
    File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/base.py", line 113, in __init__
    **kwargs)
    File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/adaptors/cpi/decorators.py", line 62, in wrap_function
    return sync_function (self, *args, **kwargs)
    File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/adaptors/slurm/slurm_job.py", line 297, in init_instance
    self._open()
    File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/adaptors/slurm/slurm_job.py", line 348, in _open
    self.shell = rsups.PTYShell(shell_url, self.session, self._logger)
    File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/utils/pty_shell.py", line 244, in __init__
    interactive=self.interactive)
    File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 174, in initialize
    posix, interactive)
    File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 584, in _create_master_entry
    % (url.schema, url.host))
    File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/exceptions.py", line 204, in _log
    return cls (msg, parent=parent, api_object=api_object, from_log=True)
    File "/ccs/home/lei/anaconda3/envs/radical/lib/python3.7/site-packages/radical/saga/exceptions.py", line 356, in __init__
    SagaException.__init__ (self, msg, parent, api_object, from_log)

    Here is my routine. I logged onto RHEA and run radical.saga jobs on the RHEA logging node. I think js = saga.job.Service("slurm://") works for me, since I do see jobs submitted to the queue and it starts to run and exit succesfully.

When you talk about using saga.job.Service("slurm://rhea.ccs.ornl.gov/"), do you mean I launch saga scripts remotly (like on my desktop) or on rhea login nodes?

wjlei1990 commented 4 years ago

Also, with regarding to my previous question, could you tell me in my scenario and use case, how to set --ntasks-per-node or --cpus-per-task correctly? Or do I even need it or not?

Another question to job description. On Rhea, Do I need to put:

jd.spmd_variation = "mpi"

It seems I don't need it, since I am going to launch srun myself.

lee212 commented 4 years ago

@andre-merzky , I can't get my rp stack working on rhea, my env is failed with the error: caught Exception: [Errno 28] No space left on device This may be caused by retired lustre altas but I need to find out how to resolve this because updating json configuration file with Alpine GPFS didn't fix the problem. This is for py3, not py2 BTW.

wjlei1990 commented 4 years ago

@lee212 might also be your home(or where you run the script) is running out of space :)

wjlei1990 commented 4 years ago

!!! Urgent...the code is complaining on RHEA Could not detect shell prompt (timeout):

An exception occured: (NoSuccess) Could not detect shell prompt (timeout) (/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py +293 (_initialize_pty)  :  raise rse.NoSuccess("Could not detect shell prompt (timeout)")) 

*** Backtrace:
   File "test_slurm.py", line 66, in <module>
    sys.exit(main())
  File "test_slurm.py", line 12, in main
    js = saga.job.Service("slurm://")
  File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/job/service.py", line 116, in __init__
    url, session, ttype=_ttype)
  File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/base.py", line 113, in __init__
    **kwargs)
  File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/adaptors/cpi/decorators.py", line 62, in wrap_function
    return sync_function (self, *args, **kwargs)
  File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/adaptors/slurm/slurm_job.py", line 297, in init_instance
    self._open()
  File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/adaptors/slurm/slurm_job.py", line 348, in _open
    self.shell = rsups.PTYShell(shell_url, self.session, self._logger)
  File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/utils/pty_shell.py", line 244, in __init__
    interactive=self.interactive)
  File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 199, in initialize
    self._initialize_pty(info['pty'], info)
  File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 293, in _initialize_pty
    raise rse.NoSuccess("Could not detect shell prompt (timeout)")
  File "/ccs/home/lei/anaconda3/envs/test2/lib/python3.7/site-packages/radical/saga/exceptions.py", line 446, in __init__
    SagaException.__init__ (self, msg, parent, api_object, from_log)

Any idea how to fix it?

andre-merzky commented 4 years ago

Hi @wjlei1990 : I am not sure what's up with the shell prompt problem - is that repeatable? What exactly are you running?

The SAGA slurm adaptor is now fixed for Rhea (in the saga branch fix/rhea), but I need to iterate on RP and EnTK a bit more.

wjlei1990 commented 4 years ago

Hi @andre-merzky , it seems to always fail in one of my virtual environment (created by conda)...It just keeps trying and timeout. However, it works for some other virtual enviroment. I gues it would be my virtual environment issue.

However, even in some virtual envs it works, it still tries multiple times to get the shell prompt. Do you think you could be able to fix it anyway. I guess this is not that urgent since I can still use it...

andre-merzky commented 4 years ago

Hi @wjlei1990 : This should not depend on the virtualenv, or at least I can't see how. Can you send me the command or code you are running please, and I'll try to reproduce it. You can also try to set

export RADICAL_SAGA_PTY_LOG_LVL=DEBUG
export RADICAL_SAGA_PTY_LOG_TGT=pty.log

and attach the resulting pty.log to this issue - this should trace the shell interactions. Last but not least, you can set

export RADICAL_SAGA_PTY_SSH_TIMEOUT=60

to see if this makes a difference (the default is 10 seconds).

wjlei1990 commented 4 years ago

PS: Thanks for the help. Since I can get it to work in some virtual envs, so this is NOT a urgent issue.


This seems doesn't work for me, at least it doesn't generate the pty.log file for me.

export RADICAL_SAGA_PTY_LOG_LVL=DEBUG
export RADICAL_SAGA_PTY_LOG_TGT=pty.log

So I just used the RADICAL_LOG_LVL=DEBUG.

The successful case is here, in the file radical.saga.cpi.log. Even it is successful, I think it tries multiple times(2 or 3 times) to find the shell prompt (between line 10 to 15) radical.saga.cpi.log

wjlei1990 commented 4 years ago

Here is a bad case(in another vritual env).

Starting from line 10, you can see it keeps tring..utils it reaches max try and failed. radical.saga.cpi.log

I also zip all the logs if it helps you locate the issue. slurm_work.bad.zip


Anyway, since this is working now(at least in some virtual env), I don't think this is an urgent issue.

Also, all the tests I have made is based on the master branch, not the fix/rhea.

andre-merzky commented 4 years ago

Thanks for the logs, @wjlei1990 !

What happens if you run this on the command line:

/usr/bin/env TERM=vt100  "/bin/bash"  -i

This should result in a new shell running rather quickly:

rivendell  merzky  ~ $ date +%s.%N; echo "date +%s.%N" | /usr/bin/env TERM=vt100  "/bin/bash"  -i
1575365236.933453427
rivendell  merzky  ~  $ date +%s.%N
1575365237.456314561

Did you try

export RADICAL_SAGA_PTY_SSH_TIMEOUT=60

Can you show me the output of

echo $PS1
echo $PS2
echo $PS3
echo $PS4
echo $PROMPT_COMMAND

Thanks!

wjlei1990 commented 4 years ago

It seems the fix you put into the code here doesn't work for me.

if  'rhea' in self.rm.host.lower():

In my case, the self.rm.host.lower() will get the value localhost. It is not *rhea*...

Could you confirm it?

shantenujha commented 4 years ago

@andre-merzky @lee212 -- can you confirm @wjlei1990 query please ?

andre-merzky commented 4 years ago

Thanks for the ping!

RP should by now have config files which use the FQHN in the job access URLs, even for local access. Can you please check if this is the case in the RP version you have installed? If not, you may want to either change the respective config entries (python2) or update RP (python3).

wjlei1990 commented 4 years ago

Got it. Thanks. Let me check it an I will post updates here.

mturilli commented 4 years ago

This works now, closing