radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Execution is stalled for long time after using radical pilot to run a program on Rivanna cluster #2894

Closed arupcsedu closed 9 months ago

arupcsedu commented 1 year ago

Hey guys, I was having this issue for quite some time.

Execution is stalled for more than 10 minutes. Please check the system information: (rc_arup) -bash-4.2$radical-stack python : /home/djy8hg/.conda/envs/rc_arup/bin/python3 pythonpath : version : 3.9.16 virtualenv : rc_arup radical.gtod : 1.20.0 radical.pilot : 1.22.0-v1.4.0-4456-gff2e45f@2835-rivanna radical.saga : 1.23.0-v1.22.0-1-g1e21463@devel radical.utils : 1.21.0

I have checked the status of the PR https://github.com/radical-cybertools/radical.pilot/pull/2855

It seemed, interactive, and ssh testing are pending. Have a look at the below screenshot. image

Attached the sandbox log as well.

rp.session.udc-ba35-36.djy8hg.019454.0002.zip rp.session.udc-ba35-36.djy8hg.019454.0003.zip

mtitov commented 1 year ago

Hi @arupcsedu, the issue was in radical.utils (RU) package, but thankfully that was fixed in the latest release

  File "/scratch/djy8hg/radical.pilot.sandbox/rp.session.udc-ba35-36.djy8hg.019454.0003/pilot.0000/rp_install/lib/python3.11/site-packages/radical/utils/signatures.py", line 175, in <module>
    from inspect   import getargspec, isclass
ImportError: cannot import name 'getargspec' from 'inspect' (/apps/software/standard/mpi/gcc/11.2.0/openmpi/4.1.4/python/3.11.1/lib/python3.11/inspect.py)
 failed

please update RU package with version 1.22

eirrgang commented 1 year ago

Note that my tests of #2855 involved a more recent version of radical.utils than your radical-stack is reporting (but by now I can't remember why...). Have you tried the devel branch?

eirrgang commented 1 year ago

Note that my tests of #2855 involved a more recent version of radical.utils than your radical-stack is reporting (but by now I can't remember why...). Have you tried the devel branch?

Oops. I was too slow. :-)

arupcsedu commented 1 year ago

With the devel branch, I am getting this error, although I added the below code in example/config.json

"uva.rivanna" : { "project" : null, "queue" : "standard", "schema" : "local", "cores" : 8 }

Added the resource_uva.json in the config folder as well.

These is the error:

================================================================================ Getting Started (RP version 1.22.0)

new session: [rp.session.udc-ba35-36.djy8hg.019454.0009] \ database : [mongodb://rct-tutorial:****@95.217.193.116:27017/rct-tutorial] ok read config ok

submit pilots
create pilot manager ok submit 1 pilot(s)caught Exception: Resource domain 'uva' is unknown.

finalize
closing session rp.session.udc-ba35-36.djy8hg.019454.0009 \ close pilot manager \ wait for 0 pilot(s) 0 ok ok session lifetime: 13.8s ok Traceback (most recent call last): File "./09_mpi_tasks.py", line 75, in pilot = pmgr.submit_pilots(pdesc) File "/home/djy8hg/.conda/envs/rp/lib/python3.8/site-packages/radical/pilot/pilot_manager.py", line 602, in submit_pilots pilot = Pilot(pmgr=self, descr=pd) File "/home/djy8hg/.conda/envs/rp/lib/python3.8/site-packages/radical/pilot/pilot.py", line 111, in init = self._session._get_jsurl (pilot) File "/home/djy8hg/.conda/envs/rp/lib/python3.8/site-packages/radical/pilot/session.py", line 959, in _get_jsurl rcfg = self.get_resource_config(resrc, schema) File "/home/djy8hg/.conda/envs/rp/lib/python3.8/site-packages/radical/pilot/session.py", line 652, in get_resource_config raise RuntimeError("Resource domain '%s' is unknown." % domain) RuntimeError: Resource domain 'uva' is unknown.

mtitov commented 1 year ago

@arupcsedu sorry for a confusion, Eric meant development branch for radical.utils only (but actually that devel updates were released in 1.22), thus you can reinitiate your stack

pip uninstall radical.pilot radical.saga radical.utils -y
pip install git+https://github.com/eirrgang/radical.pilot.git@2835-rivanna
eirrgang commented 1 year ago

@arupcsedu sorry for a confusion, Eric meant development branch for radical.utils only (but actually that devel updates were released in 1.22), thus you can reinitiate your stack

pip uninstall radical.pilot radical.saga radical.utils -y
pip install git+https://github.com/eirrgang/radical.pilot.git@2835-rivanna

Yes, I'm sorry for contributing confusion. I agree completely with @mtitov

arupcsedu commented 1 year ago

@mtitov and @eirrgang, Thank you, guys.

The first update was basically taken by executing the command you shared: pip install git+https://github.com/eirrgang/radical.pilot.git@2835-rivanna But I will cross check again and get back to you.

arupcsedu commented 1 year ago

Hey Guys, After updating the devel, the same error. Have a look the logs.

(rp) -bash-4.2$./09_mpi_tasks.py uva.rivanna

Getting Started (RP version 1.22.0)

new session: [rp.session.udc-ba34-36.djy8hg.019455.0000] \ database : [mongodb://rct-tutorial:****@95.217.193.11 rp.session.udc-ba34-36.djy8hg.019455.0000_example.zip rp.session.udc-ba34-36.djy8hg.019455.0000_sandbox.zip 6:27017/rct-tutorial] ok read config ok

submit pilots
create pilot manager ok submit 1 pilot(s) pilot.0000 uva.rivanna 8 cores 0 gpus ok

submit tasks
create task manager ok create 2 task description(s) .. ok submit: ########################################################################

gather results
wait :

mtitov commented 1 year ago

hi @arupcsedu , thank you for logs, seems like this issue is related to the race condition in SAGA component, can you please try this SAGA branch hotfix/slurm_js_jobs?

pip uninstall radical.saga -y
pip install git+https://github.com/radical-cybertools/radical.saga.git@hotfix/slurm_js_jobs
arupcsedu commented 1 year ago

Still, I am getting the same.

Let me share a bit more information. This operation creates a job on rivanna slurm and It is still in Queued state. Have a look the screenshot and logs. image

================================================================================ Getting Started (RP version 1.22.0)

new session: [rp.session.udc-ba34-36.djy8hg.019457.0001] \ database : [mongodb://rct-tutorial:****@95.217.193.116:27017/rct-tutorial] ok read config ok

submit pilots
create pilot manager ok submit 1 pilot(s) pilot.0000 uva.rivanna 8 cores 0 gpus ok

submit tasks
create task manager ok create 2 task description(s) .. ok submit: ########################################################################

gather results
wait :

rp.session.udc-ba34-36.djy8hg.019457.0001.zip

I think it is better to wait for getting access to Rivanna for you guys. Do you guys have access, already?

mtitov commented 1 year ago

@arupcsedu It seems that your batch job was submitted successfully, and it is in a queue (that's what your screenshot shows), thus RP application will proceed further after batch job starts the execution (will change its state to RUNNING)

I can look into your client sandbox as well (directory with the session ID name in your current working directory).

Do you guys have access, already?

not yet, but some of us will get access soon (@AymenFJA)

andre-merzky commented 9 months ago

closing as outdated