Closed darkwhite29 closed 3 years ago
Hey @darkwhite29. By looking at bootstrap_0.out
it seems like your agent side python environment was not created properly for some reason since one of the serialization packages is missing dill
.
bootstrap_0.out
:
purge install source at radical.pilot-1.6.6/
1624331006.0000,rp_install_stop,bootstrap_0,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
1624331006.0000,ve_setup_stop,bootstrap_0,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
verify python viability: /gpfs/alpine/csc299/scratch/litan/radical.pilot.sandbox/ve.ornl.summit.1.6.6/bin/python ... ok
verify module viability: radical.pilot ...Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/gpfs/alpine/scratch/litan/csc299/radical.pilot.sandbox/rp.session.login4.litan.018799.0027/pilot.0000/rp_install/lib/python3.7/site-packages/radical/pilot/__init__.py", line 43, in <module>
from . import agent
File "/gpfs/alpine/scratch/litan/csc299/radical.pilot.sandbox/rp.session.login4.litan.018799.0027/pilot.0000/rp_install/lib/python3.7/site-packages/radical/pilot/agent/__init__.py", line 7, in <module>
from .mpi_worker import MPI_Func_Worker
File "/gpfs/alpine/scratch/litan/csc299/radical.pilot.sandbox/rp.session.login4.litan.018799.0027/pilot.0000/rp_install/lib/python3.7/site-packages/radical/pilot/agent/mpi_worker.py", line 7, in <module>
import dill
ModuleNotFoundError: No module named 'dill'
failed
python installation cannot load module radical.pilot - abort
What you can do is to delete the existed RP-environment in the agent side and let RP do a fresh install and see if that works.
If that did not work, then you can create your own venv
install radical pilot in it and let RP use that specific env
by specifying the path to that env
in the resource_ornl.json
file and the set "virtenv_mode": "use",
Thank you for digging this out. I reinstalled a new Python environment and did a fresh install of RP. However, the same error occurred. I checked to see if the dill
package is installed:
(mocu2) [litan@login5.summit N7]$ pip install dill
Requirement already satisfied: dill in /autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu2/lib/python3.7/site-packages (0.3.4)
I am not sure if the error really is related to incorrectly installed RCT stack, but it is possible that the installed dill
package is somehow not read correctly by RCT. Below is my stack:
(mocu2) [litan@login5.summit N7]$ radical-stack
python : /ccs/home/litan/miniconda3/envs/mocu2/bin/python3
pythonpath : /sw/summit/xalt/1.2.1/site:/sw/summit/xalt/1.2.1/libexec
version : 3.7.9
virtualenv : mocu2
radical.entk : 1.6.5
radical.gtod : 1.5.0
radical.pilot : 1.6.6-v1.6.6-91-g3de34a4@feature-funcs_v2
radical.saga : 1.6.6
radical.utils : 1.6.6
Any thoughts at this point?
Did you recreate the VENV used by the script you are running, or the one used by the pilot in radical.pilot.sandbox
? mocu2
sounds like the former, but @AymenFJA referred to the latter, which is the one which triggers the error in the executor.
@andre-merzky Yes indeed, I recreated a new conda virtual environment and reinstalled RP there but did not work. Now I am trying the second approach @AymenFJA suggested.
Just to confirm I am doing it correctly: I have an old mocu
environment and a new mocu-pilot
environment (a fresh installation of RP is done in mocu-pilot
and I saw dill
is successfully installed).
...
Successfully built radical.pilot
Installing collected packages: urllib3, idna, chardet, setproctitle, requests, regex, radical.gtod, pyzmq, pymongo, netifaces, msgpack, future, colorama, radical.utils, parse, apache-libcloud, whichcraft, radical.saga, python-hostlist, ntplib, mpi4py, dill, radical.pilot
Successfully installed apache-libcloud-3.3.1 chardet-4.0.0 colorama-0.4.4 dill-0.3.4 future-0.18.2 idna-2.10 mpi4py-3.0.3 msgpack-1.0.2 netifaces-0.11.0 ntplib-0.4.0 parse-1.19.0 pymongo-3.11.4 python-hostlist-1.21 pyzmq-22.1.0 radical.gtod-1.5.0 radical.pilot-1.6.6 radical.saga-1.6.6 radical.utils-1.6.6 regex-2021.4.4 requests-2.25.1 setproctitle-1.2.2 urllib3-1.26.5 whichcraft-0.6.1
Now in the resource_ornl.json
of mocu
, I need to use the RP installed in mocu-pilot
. Below is the content of resource_ornl.json
of mocu
:
"summit": {
"description" : "ORNL's summit, a Cray XK7",
"notes" : null,
"schemas" : ["local"],
"local" : {
"job_manager_hop" : "fork://localhost/",
"job_manager_endpoint" : "lsf://localhost/",
"filesystem_endpoint" : "file://localhost/"
},
"default_queue" : "batch",
"resource_manager" : "LSF_SUMMIT",
"agent_config" : "default",
"agent_scheduler" : "CONTINUOUS",
"agent_spawner" : "POPEN",
"agent_launch_method" : "JSRUN",
"task_launch_method" : "JSRUN",
"mpi_launch_method" : "JSRUN",
"pre_bootstrap_0" : ["module unload xl",
"module unload xalt",
"module unload spectrum-mpi",
"module unload py-pip",
"module unload py-virtualenv",
"module unload py-setuptools",
"module load gcc/8.1.1",
"module load zeromq/4.2.5",
"module load python/3.7.0",
"module list"],
"pre_bootstrap_1" : ["module unload xl",
"module unload xalt",
"module unload spectrum-mpi",
"module unload py-pip",
"module unload py-virtualenv",
"module unload py-setuptools",
"module load gcc/8.1.1",
"module load zeromq/4.2.5",
"module load python/3.7.0",
# increase process limit on node
"ulimit -u 65536"],
"valid_roots" : ["$MEMBERWORK/"],
"default_remote_workdir" : "$MEMBERWORK/%(pd.project)s",
"rp_version" : "local",
"virtenv_mode" : "use",
"stage_cacerts" : true,
"python_dist" : "default",
"virtenv_dist" : "default",
"gpus_per_node" : 6,
"sockets_per_node" : 2,
"lfs_per_node" : "/tmp",
"system_architecture" : {"smt": 4,
"options": ["gpumps", "nvme"]}
},
@AymenFJA says "specifying the path to that env
in the resource_ornl.json
file". Where in this file to specify the path to mocu-pilot
? Is it rp_version
? Thanks!
Hey @darkwhite29, to make RP using a preinstalled env
all you need to do is the following:
"rp_version" : "installed",
"virtenv" : "/path/to/your/venv/mocu-pilot/",
"virtenv_mode" : "use",
FYI: if you are using Anaconda
Python distribution, then please change the python_dist
to anaconda
.
Also, I just noticed that, because I did not test these executors on Summit
, so I did not create the appropriate config
file for that machine. As a temporary solution is to change the following, in your resource_ornl
file please:
let us know if that works.
Thanks @AymenFJA for the solution. I used miniconda
instead of anaconda
, but I still use anaconda
for python_dist
per your advice.
It is getting better now. The dill
package error is gone, at least, but still the same error from the command line (all the task.0000xx
folders at the client are empty as before):
(mocu2) [litan@login4.summit N7]$ python runMainForPerformanceMeasure.py
Unstable system has been found
Round: 0 / 1 - iODE Iteration: 0 Initial MOCU: 1.520765083606797 Computation time: 2.1073079109191895
iterative: True
================================================================================
An HPC Workflow for MOCU on GPU
================================================================================
new session: [rp.session.login4.litan.018801.0004] \
database : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test] ok
--------------------------------------------------------------------------------
submit pilots
create pilot manager ok
submit 1 pilot(s)
pilot.0000 ornl.summit 672 cores 24 gpus ok
--------------------------------------------------------------------------------
submit tasks
create task manager ok
create: ########################################################################
submit: ########################################################################
--------------------------------------------------------------------------------
gather results
wait : Traceback (most recent call last):
File "runMainForPerformanceMeasure.py", line 97, in <module>
MOCUCurve, experimentSequence, timeComplexity = findMOCUSequence(criticalK, isSynchronized, MOCUInitial, K_max, w, N, deltaT, MVirtual, MReal, TVirtual, TReal, aLowerUpdated, aUpperUpdated, it_idx, update_cnt, iterative = iterative)
File "/gpfs/alpine/csc299/scratch/litan/MOCU/Byung-Jun/new/ExaLearn-ODED-Kuramoto-main/N7/findMOCUSequence.py", line 145, in findMOCUSequence
tmgr.wait_tasks()
File "/ccs/home/litan/miniconda3/envs/mocu2/lib/python3.7/site-packages/radical/pilot/task_manager.py", line 1006, in wait_tasks
time.sleep (0.1)
KeyboardInterrupt
More stuff has been generated at the client though. Below is the entire session sandbox:
rp.session.login4.litan.018801.0004.zip
By checking bootstrap_0.out
I notice some error I am not familiar with. Could you please take a look? Thanks!
I confirm that running the example code here produces the same error as my workflow:
(mocu2) [litan@login3.summit test]$ python 12_task_function_1.py ornl.summit
================================================================================
Getting Started (RP version 1.6.6)
================================================================================
new session: [rp.session.login3.litan.018801.0007] \
database : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test] ok
read config ok
--------------------------------------------------------------------------------
submit pilots
create pilot manager ok
submit 1 pilot(s)
pilot.0000 ornl.summit 168 cores 6 gpus ok
--------------------------------------------------------------------------------
submit tasks
create task manager ok
create: ########################################################################
submit: ########################################################################
--------------------------------------------------------------------------------
gather results
wait : --------------
RADICAL Utils -- Stacktrace [57244] [MainThread]
litan 57244 53655 7 14:29 pts/9 00:00:48 | \_ python 12_task_function_1.py ornl.summit
litan 57590 57244 0 14:29 pts/66 00:00:00 | \_ /bin/bash -i
litan 57600 57244 0 14:29 pts/67 00:00:00 | \_ /bin/sh -i
Traceback (most recent call last):
File "12_task_function_1.py", line 116, in <module>
tmgr.wait_tasks()
File "/ccs/home/litan/miniconda3/envs/mocu2/lib/python3.7/site-packages/radical/pilot/task_manager.py", line 1006, in wait_tasks
time.sleep (0.1)
KeyboardInterrupt
--------------
exit requested
--------------------------------------------------------------------------------
finalize
closing session rp.session.login3.litan.018801.0007 \
close task manager ok
close pilot manager \
wait for 1 pilot(s)
0 ok
ok
+ rp.session.login3.litan.018801.0007 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 673.3s ok
--------------------------------------------------------------------------------
Hey @darkwhite29 , checking the radical.log
, I see:
File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/pilot/utils/component.py", line 480, in _worker_thread
self._initialize()
File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/pilot/utils/component.py", line 623, in _initialize
self.initialize()
File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/pilot/agent/executing/funcs.py", line 110, in initialize
self._spawn(self._launcher, funcs)
File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/pilot/agent/executing/funcs.py", line 145, in _spawn
launch_cmd, hop_cmd = launcher.construct_command(funcs, fname)
File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/pilot/agent/launch_method/jsrun.py", line 138, in construct_command
task_sandbox = t['task_sandbox_path']
KeyError: 'task_sandbox_path'
It shows that the agent_launch_method
is JSRUN
which is not compatible with the current agent, can you change it to: "agent_launch_method" : "SSH",
and give it a try, please?
Unfortunately still the same error:
(mocu2) [litan@login5.summit test]$ python 12_task_function_1.py ornl.summit
================================================================================
Getting Started (RP version 1.6.6)
================================================================================
new session: [rp.session.login5.litan.018802.0001] \
database : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test] ok
read config ok
--------------------------------------------------------------------------------
submit pilots
create pilot manager ok
submit 1 pilot(s)
pilot.0000 ornl.summit 168 cores 6 gpus ok
--------------------------------------------------------------------------------
submit tasks
create task manager ok
create: ########################################################################
submit: ########################################################################
--------------------------------------------------------------------------------
gather results
wait : --------------
RADICAL Utils -- Stacktrace [872] [MainThread]
litan 872 60887 6 11:12 pts/38 00:04:17 | \_ python 12_task_function_1.py ornl.summit
litan 1361 872 0 11:12 pts/48 00:00:00 | \_ /bin/bash -i
litan 1378 872 0 11:12 pts/49 00:00:00 | \_ /bin/sh -i
Traceback (most recent call last):
File "12_task_function_1.py", line 116, in <module>
tmgr.wait_tasks()
File "/ccs/home/litan/miniconda3/envs/mocu2/lib/python3.7/site-packages/radical/pilot/task_manager.py", line 1006, in wait_tasks
time.sleep (0.1)
KeyboardInterrupt
--------------
exit requested
--------------------------------------------------------------------------------
finalize
closing session rp.session.login5.litan.018802.0001 \
close task manager ok
close pilot manager \
wait for 1 pilot(s)
0 ok
ok
+ rp.session.login5.litan.018802.0001 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 3692.6s ok
--------------------------------------------------------------------------------
The entire session sandbox is attached below:
rp.session.login5.litan.018802.0001.zip
I noticed in our submitted papers, RAPTOR has been tested on Summit before. @andre-merzky do you have a tested resource_ornl.json
for Summit? Thank you so much.
Hi @darkwhite29 ,
From the log files, I can see that the executor
is up and running, you can see that under func_exec.0000
folder, but from the func_exec.0000.err
I noticed that it is a file system permission issue:
Traceback (most recent call last):
File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/bin/radical-pilot-agent-funcs", line 316, in <module>
executor = Executor()
File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/bin/radical-pilot-agent-funcs", line 66, in __init__
self._initialize()
File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/bin/radical-pilot-agent-funcs", line 96, in _initialize
self._zmq_req = ru.zmq.Getter(channel='funcs_req_queue', url=addr_req)
File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/utils/zmq/queue.py", line 536, in __init__
self._log = Logger(name=self._uid, ns='radical.utils')
File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/utils/logger.py", line 286, in __init__
elif t in ['.'] : h = FSHandler("%s/%s.log" % (p, n))
File "/autofs/nccs-svm1_home1/litan/miniconda3/envs/mocu-pilot/lib/python3.7/site-packages/radical/utils/logger.py", line 174, in __init__
logging.FileHandler.__init__(self, target)
File "/ccs/home/litan/miniconda3/envs/mocu-pilot/lib/python3.7/logging/__init__.py", line 1087, in __init__
StreamHandler.__init__(self, self._open())
File "/ccs/home/litan/miniconda3/envs/mocu-pilot/lib/python3.7/logging/__init__.py", line 1116, in _open
return open(self.baseFilename, self.mode, encoding=self.encoding)
OSError: [Errno 30] Read-only file system: '/autofs/nccs-svm1_home1/litan/funcs_req_queue.get.0000.log'
I do not know how to proceed from here, @andre-merzky any suggestions, please?
Thanks @AymenFJA for the diagnosis. I confirm I was using the GPFS filesystem on Summit which is read and write allowable during runs:
[litan@login5.summit N7]$ pwd
/gpfs/alpine/csc299/scratch/litan/MOCU/Byung-Jun/new/ExaLearn-ODED-Kuramoto-main/N7
However, my virtual environment is in the /home
directory which is read-only though. All my previously used virtual environments are in the /home
directory and all previous RCT runs were free of this issue. Maybe using RAPTOR writes the virtual environment somehow at runtime?
I reinstalled conda
in GPFS and reran the workflow, still getting the same error from func_exec.0000.err
:
Traceback (most recent call last):
File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/bin/radical-pilot-agent-funcs", line 316, in <module>
executor = Executor()
File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/bin/radical-pilot-agent-funcs", line 66, in __init__
self._initialize()
File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/bin/radical-pilot-agent-funcs", line 96, in _initialize
self._zmq_req = ru.zmq.Getter(channel='funcs_req_queue', url=addr_req)
File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/lib/python3.7/site-packages/radical/utils/zmq/queue.py", line 536, in __init__
self._log = Logger(name=self._uid, ns='radical.utils')
File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/lib/python3.7/site-packages/radical/utils/logger.py", line 286, in __init__
elif t in ['.'] : h = FSHandler("%s/%s.log" % (p, n))
File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/lib/python3.7/site-packages/radical/utils/logger.py", line 174, in __init__
logging.FileHandler.__init__(self, target)
File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/lib/python3.7/logging/__init__.py", line 1087, in __init__
StreamHandler.__init__(self, self._open())
File "/gpfs/alpine/csc299/scratch/litan/miniconda3/envs/mocu/lib/python3.7/logging/__init__.py", line 1116, in _open
return open(self.baseFilename, self.mode, encoding=self.encoding)
OSError: [Errno 30] Read-only file system: '/autofs/nccs-svm1_home1/litan/funcs_req_queue.get.0000.log'
The function log somehow is still written to the /home
directory. Any ideas at this point?
@AymenFJA : it looks like the func executor gets $HOME
as sandbox. I would suggest to add this to the executor startup script in executing/funcs.py
around line 168:
fout.write('cd %s\n' % sandbox)
I confirm the fix @andre-merzky provided works! Thanks @andre-merzky ! Have a nice weekend.
A quick question @andre-merzky: For using RAPTOR, do I need to make any changes to resource_ornl.json
as I did here, or I can just use the default one? Thanks!
Hey @darkwhite29 - you should be able to use the released version of RP, and you don't need any changes to the configs.
Thanks a lot @andre-merzky for the follow-up. Thanks a lot @AymenFJA for the persistent help. I think now it is good to close this ticket.
The brief background: A function named
computeExpectedRemainingMOCU()
that contains primary CUDA code is needed to be run on GPU.Now I have an error when running the workflow:
Code and log files for client and agent are attached in a zip file below for your information. function_call_issue.zip