Closed Weiming-Hu closed 5 years ago
Hey Weiming, can you run radical-pilot-fetch-logfiles
after sourcing the virtualenv from the same directory as where you have the EnTK script please? You will have a folder of the format re.session.*
which you can zip and upload here.
Here you are. re.session.cheyenne1.wuh20.017933.0000.tar.gz
Thank you
I know this one! :-)
IOError: [Errno 2] No such file or directory: '/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/pilot/configs/umgr_default.json'
This seems like an incomplete RP installations. Some version of pip / setuptools don't install our json config files. How was RP installed in this virtualenv? Can you please check if there are any json files in
venv/lib/python2.7/site-packages/radical/pilot/configs
?
This is some information that might be helpful:
> pip --version
pip 19.0.1 from /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/pip (python 2.7)
> python --version
Python 2.7.15
I have a message during installation of EnTK:
radical-pilot 0.50.21 has requirement netifaces==0.10.4, but you'll have netifaces 0.10.9 which is incompatible.
And every time after I log onto Cheyenne, I need to reinstall radical.entk otherwise I'm not able to run entk.
Here are the files:
> ls venv/lib/python2.7/site-packages/radical/pilot/configs/
agent_cray_aprun.json agent_osg.json __init__.pyc resource_chameleon.json resource_epsrc.json resource_iu.json resource_lumc.json resource_nersc.json resource_radical.json resource_vtarc_dt.json session_default.json
agent_cray.json agent_rhea.json pmgr_default.json resource_das4.json resource_fub.json resource_local.json resource_ncar.json resource_ornl.json resource_rice.json resource_xsede.json umgr_default.json
agent_default.json __init__.py resource_aliases.json resource_das5.json resource_futuregrid.json resource_lrz.json resource_ncsa.json resource_osg.json resource_stfc.json resource_yale.json
Wait, now I am confused: your ls lists
umgr_default.json, but the error explicitly said
No such file or directory`. So, which one is it? The installation is a red herring then, as the file seems there. Is that a permission issue? Was that a file system issue? The latter would likely mean that its not reproducible - does the problem persists?
The netifaces
warning can be ignored for now - but thanks for mentioning it, I'll look into that...
What could trigger the need for re-installing EnTK?
Hi Andre. I'm afraid the issue is strange. I ran the script again and I get the same error for missing file even when the file actually exists.
OK. This is the weird part. After I got the error message and after I terminated the EnTK, radical.pilot package is gone, literally. It looks like it gets deleted somehow during the process because I can no longer even run the simple command:
> entk-version
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/__init__.py", line 4, in <module>
from radical.entk.pipeline.pipeline import Pipeline
File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/pipeline/pipeline.py", line 1, in <module>
import radical.utils as ru
ImportError: No module named utils
And this is what I mentioned that I need to reinstall radical.entk. If I reinstall radical.entk, the missing files will be reinstalled. And I guess if I try to run the script, I'm back to the beginning of this loop.
Please let me know if we need a live help session here because I feel the problem is getting complicated. Thank you very much.
This indeed is strange, I just made some fixes in RP (https://github.com/radical-cybertools/radical.pilot/pull/1808). We have a working RP again on Cheyenne. I'll give Weiming's script a try now.
Thanks Vivek - I don't have a useful suggestion myself, really.
Hey Weiming, tested your script as well on Cheyenne. Its executing all 3 stages. The tasks in the third stage fail with the option '--mapping-txt' is required but missing
. I'll leave that to you to debug. You can install the fix/cheyenne branch of radical pilot in your virtualenv on Cheyenne and it should be back to normal.
This is awesome. I will give this a try ASAP.
Actually the script has run only once out of about 5 runs. I am not sure where the error is but the pilot seems to fail while the first stage is executing. Same script, no change from the successful run. I can't locate the error and get the impression that the job got killed by the system(?). @andre-merzky could you take a look at the log files please.
Please see the debug logs. log_files.zip
@Weiming-Hu feel free to give it a try, let's see how it behaves for you.
@andre-merzky From the debug.log file (with saga verbose msgs):
2019-02-07 18:00:11,557: radical.saga.pty : pmgr.0000.launching.0 : Thread-2 : DEBUG : read : [ 128] [ 226] ( Job_Name = pilot.0000\n job_state = F\n ctime = Thu Feb 7 17:58:12 2019\n exec_host = r3i0n23/0*36+r3i1n11/0*36\n mtime = Thu Feb 7 17:59:31 2019\n stime = Thu Feb 7 17:58:19 2019\n Exit_status = 265\n)
2019-02-07 18:00:11,557: radical.saga.pty : pmgr.0000.launching.0 : Thread-2 : DEBUG : read : [ 128] [ 10] (PROMPT-0->)
2019-02-07 18:00:11,557: radical.saga.cpi : pmgr.0000.launching.0 : Thread-2 : DEBUG : check state: F
2019-02-07 18:00:11,558: radical.saga.cpi : pmgr.0000.launching.0 : Thread-2 : DEBUG : use state: Done
2019-02-07 18:00:11,558: radical.saga.cpi : pmgr.0000.launching.0 : Thread-2 : INFO : Job monitoring thread updating Job [pbspro://localhost/]-[4275618] (old state: Running, new state: Failed)
Based on this, exit code 265 means signal 9. So, the job is being killed (note: not the rp client -- probably not any process limits).
I'm afraid the problem persists and I still get the error that umgr_default.json
is missing. Here is a detailed breakdown of what I did to help you reproduce what I have.
First, I load the python module after logging onto Cheyenne:
> module purge
> module load python/2.7.15 gnu/7.2.0 hdf5/1.10.1
> module ls
Currently Loaded Modules:
1) python/2.7.15 2) gnu/7.2.0 3) hdf5/1.10.1
Then, I build a virtual env and install radical.entk, pyyaml, and netcdf.
> virtualenv venv
> source venv/bin/activate
> HDF5_INCDIR=/glade/u/apps/ch/opt/hdf5/1.10.1/gnu/7.1.0/include/ HDF5_LIBDIR=/glade/u/apps/ch/opt/hdf5/1.10.1/gnu/7.1.0/lib pip install pyyaml netCDF4 radical.entk
... [Ignore log file]
And I checked that umgr_default.json
file exists.
> > file venv/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py
venv/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py: Python script, ASCII text executable
And entk works:
> entk-version
0.7.11
It should be properly set up by now. Now run the script.
> python runme.py --wcfg workflow_cfg_cheyenne.yml --rcfg resource_cfg_cheyenne.yml
... [Ignore some logs]
2019-02-08 15:38:10,254: radical.entk.resource_manager.0000: MainProcess : pmgr.0000.subscriber._state_sub_cb: ERROR : Pilot has failed
... [Ignore some more logs]
Exception in thread Thread-1:
Traceback (most recent call last):
File "/glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 263, in _process_tasks
umgr = rp.UnitManager(session=rmgr._session)
File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 102, in __init__
File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/read_json.py", line 26, in read_json
IOError: [Errno 2] No such file or directory: '/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/pilot/configs/umgr_default.json'
Weird! It does not find the file. If I look into the folder, the file is no longer there.
> ls venv/lib/python2.7/site-packages/radical/entk/execman/rp
__init__.py __init__.pyc resource_manager.py resource_manager.pyc task_manager.py task_manager.pyc task_processor.py task_processor.pyc
And entk is no longer working.
> entk-version
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/__init__.py", line 4, in <module>
from radical.entk.pipeline.pipeline import Pipeline
File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/pipeline/pipeline.py", line 1, in <module>
import radical.utils as ru
ImportError: No module named utils
I'm sorry for being verbose. But this is indeed so strange for me.
Just an update that I have dropped an email with Cheyenne support to see why the jobs are being killed.
@Weiming-Hu The set of commands I use to install:
module load python/2.7.15
virtualenv $HOME/ve
source $HOME/ve/bin/activate
pip install radical.entk
pip install netcdf4
Could you give that a try? Also please check that there are no conflicting settings in your bashrc file.
I don't have a .bashrc file, but a .bash_profile. I set RADICAL_PILOT_DBURL
and RADICAL_ENTK_VERBOSE
variable.
I have to install pyyaml
before I can run the script.
Unfortunately, The same problem persists. Same error message.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/glade/u/home/wuh20/ve/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 263, in _process_tasks
umgr = rp.UnitManager(session=rmgr._session)
File "/glade/u/home/wuh20/ve/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 102, in __init__
File "/glade/u/home/wuh20/ve/lib/python2.7/site-packages/radical/utils/read_json.py", line 26, in read_json
IOError: [Errno 2] No such file or directory: '/glade/u/home/wuh20/ve/lib/python2.7/site-packages/radical/pilot/configs/umgr_default.json'
I believe Weiming can also reproduce the error that I can now.
I confirmed that Cheyenne is killing our jobs. It seems to be because of memory exhaustion by the tasks on the compute nodes. See response from Cheyenne admin:
Apparently your jobs are killed due to memory exhaustion in the nodes in which they were running.
Out of the 23 jobs that you ran over last 5 days, 4 were successful and 19 failed. The ones that
are failed are due to memory exhaustion according to our logs. But I am little suspicious about our
data as all the pieces do not seem consistent. Is there any way you can share your job suite to me
for myself to submit your jobs and dig into more details ? If yes, please point me to your
directory location where your jobs, scripts etc are there.
I am trying to prepare a single job that can reproduce the error. @Weiming-Hu do you have any profiles or jobs that you ran on Cheyenne? If you already have them, I can use the same to reproduce the error.
@Weiming-Hu , can you please try the following:
chmod -R a-w ve/lib/python2.7/site-packages/radical
I hope this triggers a write error for whatever is purging your installation...
@andre-merzky The error with the missing JSON file is now resolved. I had to comment out a PATH setting in his bash_profile. Also, there might have been an error with the netifaces installation, which resolved when upgrading.
@vivek-bala @Weiming-Hu : if memory usage is a problem, you can try adding ulimit -s 1024
to your bashrc. Python by default creates threads with 8MB stack size - the above command would limit that to 1MB. I am not sure if stacksize causes memory issues, but I have seen systems where that's the case.
How did the $PATH
setting affect file persistence / deletion? This is funny :-)
Cool, will try that as well. I think we are also reading ~GB of data in our tasks, so I will test if that is the reason as well. (the tasks start but don't complete)
How did the
$PATH
setting affect file persistence / deletion? This is funny :-)
I don't know really. I also used the fix/cheyenne branch instead of master. Didn't dig into what the actual source is, I simply repeated the steps that I did to create my VE. You want me to find out? (no promises :-P )
Yes, that would be great. Don't get lost in that part of course - but I do remember reports in the past about disappearing installations, so this is not a totally isolated incidence. If you happen to be able to reproduce this outside of Cheyenne, I'd be happy to pick this up myself! Thanks :-)
Update: The original issue has been fixed for Weiming. We have been able to run tasks from the three stages on Cheyenne. The tasks themselves initially failed since multiple tasks (reading ~20-30GB of data) were running on a single compute node, thus barfing the memory. This has been fixed by keeping 1-2 tasks per compute node.
Status: The tasks themselves do not complete due to an error from the MPI layer on Cheyenne.
MPT: shepherd terminated: r11i3n25 - job aborting
We have a ticket with the help desk to resolve this.
Thank you, Vivek. Let me know when I should give it a try.
Hey Weiming, you can give the scripts a try now. There were some typos in the script that caused the MPI shepher termination. It should be resolved now.
Now it seems that we have a slightly different situation. Pilot has complete, but the final message is in red:
2019-02-17 19:58:30,732: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0027 with state EXECUTED
2019-02-17 19:58:30,955: radical.entk.task_manager.0000: task-manager : umgr.0000.idler._state_pull_cb: INFO : Transition of task.0027 to new state EXECUTED successful
2019-02-17 19:58:30,956: radical.entk.task_manager.0000: task-manager : umgr.0000.idler._state_pull_cb: INFO : Pushed task task.0027 with state EXECUTED to completed queue re.session.cheyenne6.wuh20.017945.0000-completedq-1
2019-02-17 19:58:31,046: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0024 with state FAILED
2019-02-17 19:58:31,252: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Transition of task.0024 to new state FAILED successful
2019-02-17 19:58:31,412: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Got finished task task.0027 from queue
2019-02-17 19:58:31,671: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0027 with state DEQUEUEING
2019-02-17 19:58:31,889: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Transition of task.0027 to new state DEQUEUEING successful
2019-02-17 19:58:32,296: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0027 with state DEQUEUED
2019-02-17 19:58:32,525: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Transition of task.0027 to new state DEQUEUED successful
...
2019-02-17 20:00:21,183: radical.entk.resource_manager.0000: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: DONE
2019-02-17 20:00:21,183: radical.entk.resource_manager.0000: MainProcess : pmgr.0000.subscriber._state_sub_cb: ERROR : Pilot has completed
And there is no output data, and no unit*** folders.
Mostly addressed, issues on the last stage for thread number configuration.
Issue resolved in commit f73b197.
Hi. I have again got some errors from this thread.
2019-03-19 11:21:58,557: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: ERROR : Unknown error in synchronizer: Expected (base) type(s) <type 'str'>, but got <type 'NoneType'>..
Terminating thread
Traceback (most recent call last):
File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 888, in _synchronizer
task_update(msg, '%s-sync-to-cb' % self._sid, props.correlation_id, mq_channel)
File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 641, in task_update
completed_task.from_dict(msg['object'])
File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/task/task.py", line 810, in from_dict
actual_type=type(d['executable']))
TypeError: Expected (base) type(s) <type 'str'>, but got <type 'NoneType'>.
This looks like a problem that we used to have. Does anybody have an idea how to solve this?
Here is my environment.
(venv) wuh20@cheyenne2:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> entk-version
0.7.15
(venv) wuh20@cheyenne2:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> radical-stack
python : 2.7.15
pythonpath :
virtualenv : /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv
radical.entk : 0.7.15
radical.pilot : 0.50.21-v0.50.21-2-gbfedd8f@fix-cheyenne
radical.utils : 0.50.3
saga : 0.50.5
Thank you
Looks like even if I revert to the old version and old code, I still get errors.
2019-03-20 14:00:25,335: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: ERROR : Unknown error in synchronizer: Expected (base) type(s) <type 'list'>, but got <type 'unicode'>..
Terminating thread
Traceback (most recent call last): File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv-old/lib/python2.7/site-packages/radical/e
ntk/appman/appmanager.py", line 907, in _synchronizer
task_update(msg, '%s-sync-to-enq' % self._sid, props.correlation_id, mq_channel) File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv-old/lib/python2.7/site-packages/radical/e
ntk/appman/appmanager.py", line 641, in task_update
completed_task.from_dict(msg['object']) File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv-old/lib/python2.7/site-packages/radical/e
ntk/task/task.py", line 778, in from_dict
actual_type=type(d['executable']))
TypeError: Expected (base) type(s) <type 'list'>, but got <type 'unicode'>.
This has been resolved. @vivek-bala is going to make a hot release for this issue. Waiting on Vivek to close this issue.
Resolved.
Hi. I have had some problems running EnTK on Cheyenne. Here is the error message:
The radical pilot sandbox folder should be accessible. It is at
/glade/u/home/wuh20/scratch/radical.pilot.sandbox
. Please let me know how I can assist the debugging process.Thank you very much!