EnTK failed on Cheyenne

Weiming-Hu commented 5 years ago

Hi. I have had some problems running EnTK on Cheyenne. Here is the error message:

> python runme.py --rcfg resource_cfg_cheyenne.y$
l --wcfg workflow_cfg_cheyenne.yml                                                                                                       
2019-02-05 16:12:20,091: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : python.interpreter   
version: 2.7.15 (default, Jan 11 2019, 15:22:07) [GCC 7.3.0]                                                                             
2019-02-05 16:12:20,092: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    :                      
pid/tid: 42256/MainThread                                                                                                                
2019-02-05 16:12:20,092: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Application Manager $
nitialized                                                                                                                               
2019-02-05 16:12:21,026: radical.entk.task_processor: MainProcess                     : MainThread     : INFO    : python.interpreter   $
ersion: 2.7.15 (default, Jan 11 2019, 15:22:07) [GCC 7.3.0]                                                                              
2019-02-05 16:12:21,026: radical.entk.task_processor: MainProcess                     : MainThread     : INFO    :                      $
id/tid: 42256/MainThread                                                                                                                 
2019-02-05 16:12:21,029: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : python.interpre
ter   version: 2.7.15 (default, Jan 11 2019, 15:22:07) [GCC 7.3.0]
2019-02-05 16:12:21,029: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    :                
      pid/tid: 42256/MainThread
2019-02-05 16:12:21,029: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Created resourc
e manager object: resource_manager.0000
2019-02-05 16:12:21,030: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Resource descri
ption validated
Processing /glade/u/home/wuh20/scratch/data/AnEn/forecasts/201801.nc Done!                                                               
Processing /glade/u/home/wuh20/scratch/data/AnEn/forecasts/201802.nc Done!                                                               
Processing /glade/u/home/wuh20/scratch/data/AnEn/analysis/201801.nc Done!                                                                
Processing /glade/u/home/wuh20/scratch/data/AnEn/analysis/201802.nc Done!                                                                
Creating standard deviation task task-sd-calc-00000                                                                                      
Creating standard deviation task task-sd-calc-00001                                                                                      
Creating standard deviation task task-sd-calc-00002                                                                                      
Creating standard deviation task task-sd-calc-00003                                                                                      
Creating standard deviation task task-sd-calc-00004                                                                                      
Creating standard deviation task task-sd-calc-00005                                                                                      
Creating standard deviation task task-sd-calc-00006                                                                                      
Creating standard deviation task task-sd-calc-00007                                                                                      
Creating standard deviation task task-sd-calc-00008                                                                                      
Creating standard deviation task task-sd-calc-00009                                                                                      
Creating similarity task task-sims-calc-00000                                                                                            
Creating similarity task task-sims-calc-00001                                                                                            
Creating similarity task task-sims-calc-00002                                                                                            
Creating similarity task task-sims-calc-00003                                                                                            
Creating similarity task task-sims-calc-00004                                                                                            
Creating similarity task task-sims-calc-00005                                                                                            
Creating similarity task task-sims-calc-00006                                                                                            
Creating similarity task task-sims-calc-00007                                                                                            
Creating similarity task task-sims-calc-00008                                                                                            
Creating similarity task task-sims-calc-00009                                                                                            
Creating analog selection task task-analog-select-00000                                                                                  
Creating analog selection task task-analog-select-00001                                                                                  
Creating analog selection task task-analog-select-00002                                                                                  
Creating analog selection task task-analog-select-00003                                                                                  
Creating analog selection task task-analog-select-00004                                                                                  
Creating analog selection task task-analog-select-00005                                                                                  
Creating analog selection task task-analog-select-00006                                                                                  
Creating analog selection task task-analog-select-00007                                                                                  
Creating analog selection task task-analog-select-00008                                                                                  
Creating analog selection task task-analog-select-00009                                                                                  
2019-02-05 16:12:21,363: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Workflow assigned to 
Application Manager                                                                                                                      
Stages have multiple tasks. Run tasks in parallel.                                                                                       
2019-02-05 16:12:23,944: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    : python.interpreter  
 version: 2.7.15 (default, Jan 11 2019, 15:22:07) [GCC 7.3.0]                                                                            
2019-02-05 16:12:23,945: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    :                     
 pid/tid: 42256/MainThread                                                                                                               
2019-02-05 16:12:23,945: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    : Created WFProcessor 
object: wfprocessor.0000                                                                                                                 
2019-02-05 16:12:23,951: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting resource req
uest submission                                                                                                                          
2019-02-05 16:12:27,231: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Pilot pilot.000
0 state: PMGR_LAUNCHING_PENDING                                                                                                          
2019-02-05 16:12:27,233: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Resource reques
t submission successful.. waiting for pilot to go Active                                                                                 
2019-02-05 16:12:27,233: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO  
  : Pilot pilot.0000 state: PMGR_LAUNCHING                                                                                               
2019-02-05 16:12:34,531: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO  
  : Pilot pilot.0000 state: PMGR_ACTIVE_PENDING                                                                                          
2019-02-05 16:14:36,630: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO  
  : Pilot pilot.0000 state: FAILED                                                                                                       
2019-02-05 16:14:36,630: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: ERROR 
  : Pilot has failed                                                                                                                     
2019-02-05 16:14:36,687: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Pilot is now ac
tive                                                                                                                                     
2019-02-05 16:14:36,687: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting synchronizer
 thread                                                                                                                                  
2019-02-05 16:14:36,688: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : synchronizer thre
ad started                                                                                                                               
2019-02-05 16:14:36,688: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting WFProcessor 
process from AppManager                                                                                                                  
2019-02-05 16:14:36,690: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    : Starting WFprocessor
 process                                                                                                                                 
2019-02-05 16:14:36,698: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : python.interpreter 
  version: 2.7.15 (default, Jan 11 2019, 15:22:07) [GCC 7.3.0]                                                                           
2019-02-05 16:14:36,700: radical.entk.task_manager.0000: MainProcess
  pid/tid: 42256/MainThread                                                                                                              
2019-02-05 16:14:36,700: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : WFprocessor started 2019-02-05 16:14:36,701: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Starting dequeue-thr
ead                                                                                                                                      2019-02-05 16:14:36,701: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Dequeue thread start
ed                                                                                                                                       2019-02-05 16:14:36,702: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Starting enqueue-thr
ead                                                                                                                                      2019-02-05 16:14:36,703: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : enqueue-thread start
ed                                                                                                                                       2019-02-05 16:14:38,013: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received pipeline
.0000 with state SCHEDULING                                                                                                              2019-02-05 16:14:38,014: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found pipeline pi
peline.0000, state SCHEDULING, completed False                                                                                           2019-02-05 16:14:38,297: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of pipeli
ne.0000 to new state SCHEDULING successful                                                                                               2019-02-05 16:14:38,462: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Created task manage
r object: task_manager.0000                                                                                                              2019-02-05 16:14:38,463: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting task manager
 process from AppManager                                                                                                                 2019-02-05 16:14:38,464: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Starting task manag
er process                                                                                                                               2019-02-05 16:14:38,469: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Starting heartbeat 
thread                                                                                                                                   2019-02-05 16:14:38,472: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating WFprocess
or                                                                                                                                       2019-02-05 16:14:38,473: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Terminating enqueue-
thread                                                                                                                                   2019-02-05 16:14:38,474: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Task Manager proces
s started                                                                                                                                2019-02-05 16:14:38,674: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received stage.00
00 with state SCHEDULING                                                                                                                 2019-02-05 16:14:38,675: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found parent pipe
line: pipeline.0000                                                                                                                      2019-02-05 16:14:38,930: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of stage.
0000 to new state SCHEDULING successful                                                                                                  2019-02-05 16:14:39,301: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat requ
est                                                                                                                                      2019-02-05 16:14:39,334: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.000
0 with state SCHEDULING                                                                                                                  2019-02-05 16:14:39,563: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of task.0
000 to new state SCHEDULING successful                                                                                                   Exception in thread Thread-1:                                                                                                            
Traceback (most recent call last):                                                                                                         File "/glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()                                                                                                                             File "/glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/threading.py", line 754, in run                                       
    self.__target(*self.__args, **self.__kwargs)                                                                                           File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/
execman/rp/task_manager.py", line 263, in _process_tasks                                                                                     umgr = rp.UnitManager(session=rmgr._session)                                                                                         
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 102, in __init__                                                                                                 
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/read_json.py", line 26, in read_json                                                                                                    
IOError: [Errno 2] No such file or directory: '/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/pilot/configs/umgr_default.json'

The radical pilot sandbox folder should be accessible. It is at /glade/u/home/wuh20/scratch/radical.pilot.sandbox. Please let me know how I can assist the debugging process.

Thank you very much!

vivek-bala commented 5 years ago

Hey Weiming, can you run radical-pilot-fetch-logfiles after sourcing the virtualenv from the same directory as where you have the EnTK script please? You will have a folder of the format re.session.* which you can zip and upload here.

Weiming-Hu commented 5 years ago

Here you are. re.session.cheyenne1.wuh20.017933.0000.tar.gz

Thank you

andre-merzky commented 5 years ago

I know this one! :-)

IOError: [Errno 2] No such file or directory: '/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/pilot/configs/umgr_default.json'

This seems like an incomplete RP installations. Some version of pip / setuptools don't install our json config files. How was RP installed in this virtualenv? Can you please check if there are any json files in venv/lib/python2.7/site-packages/radical/pilot/configs ?

Weiming-Hu commented 5 years ago

This is some information that might be helpful:

> pip --version
pip 19.0.1 from /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/pip (python 2.7)
> python --version
Python 2.7.15

I have a message during installation of EnTK:

radical-pilot 0.50.21 has requirement netifaces==0.10.4, but you'll have netifaces 0.10.9 which is incompatible.

And every time after I log onto Cheyenne, I need to reinstall radical.entk otherwise I'm not able to run entk.

Here are the files:

> ls venv/lib/python2.7/site-packages/radical/pilot/configs/
agent_cray_aprun.json  agent_osg.json   __init__.pyc           resource_chameleon.json  resource_epsrc.json   resource_iu.json     resource_lumc.json  resource_nersc.json  resource_radical.json  resource_vtarc_dt.json  session_default.json
agent_cray.json        agent_rhea.json  pmgr_default.json      resource_das4.json   resource_fub.json     resource_local.json  resource_ncar.json  resource_ornl.json   resource_rice.json     resource_xsede.json     umgr_default.json
agent_default.json     __init__.py  resource_aliases.json  resource_das5.json   resource_futuregrid.json  resource_lrz.json    resource_ncsa.json  resource_osg.json    resource_stfc.json     resource_yale.json

andre-merzky commented 5 years ago

Wait, now I am confused: your ls listsumgr_default.json, but the error explicitly saidNo such file or directory`. So, which one is it? The installation is a red herring then, as the file seems there. Is that a permission issue? Was that a file system issue? The latter would likely mean that its not reproducible - does the problem persists?

The netifaces warning can be ignored for now - but thanks for mentioning it, I'll look into that...

What could trigger the need for re-installing EnTK?

Weiming-Hu commented 5 years ago

Hi Andre. I'm afraid the issue is strange. I ran the script again and I get the same error for missing file even when the file actually exists.

OK. This is the weird part. After I got the error message and after I terminated the EnTK, radical.pilot package is gone, literally. It looks like it gets deleted somehow during the process because I can no longer even run the simple command:

> entk-version 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/__init__.py", line 4, in <module>
    from radical.entk.pipeline.pipeline import Pipeline
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/pipeline/pipeline.py", line 1, in <module>
    import radical.utils as ru
ImportError: No module named utils

And this is what I mentioned that I need to reinstall radical.entk. If I reinstall radical.entk, the missing files will be reinstalled. And I guess if I try to run the script, I'm back to the beginning of this loop.

Please let me know if we need a live help session here because I feel the problem is getting complicated. Thank you very much.

vivek-bala commented 5 years ago

This indeed is strange, I just made some fixes in RP (https://github.com/radical-cybertools/radical.pilot/pull/1808). We have a working RP again on Cheyenne. I'll give Weiming's script a try now.

andre-merzky commented 5 years ago

Thanks Vivek - I don't have a useful suggestion myself, really.

vivek-bala commented 5 years ago

Hey Weiming, tested your script as well on Cheyenne. Its executing all 3 stages. The tasks in the third stage fail with the option '--mapping-txt' is required but missing. I'll leave that to you to debug. You can install the fix/cheyenne branch of radical pilot in your virtualenv on Cheyenne and it should be back to normal.

Weiming-Hu commented 5 years ago

This is awesome. I will give this a try ASAP.

vivek-bala commented 5 years ago

Actually the script has run only once out of about 5 runs. I am not sure where the error is but the pilot seems to fail while the first stage is executing. Same script, no change from the successful run. I can't locate the error and get the impression that the job got killed by the system(?). @andre-merzky could you take a look at the log files please.

Please see the debug logs. log_files.zip

vivek-bala commented 5 years ago

@Weiming-Hu feel free to give it a try, let's see how it behaves for you.

vivek-bala commented 5 years ago

@andre-merzky From the debug.log file (with saga verbose msgs):

2019-02-07 18:00:11,557: radical.saga.pty    : pmgr.0000.launching.0           : Thread-2       : DEBUG   : read : [  128] [  226] (    Job_Name = pilot.0000\n    job_state = F\n    ctime = Thu Feb  7 17:58:12 2019\n    exec_host = r3i0n23/0*36+r3i1n11/0*36\n    mtime = Thu Feb  7 17:59:31 2019\n    stime = Thu Feb  7 17:58:19 2019\n    Exit_status = 265\n)
2019-02-07 18:00:11,557: radical.saga.pty    : pmgr.0000.launching.0           : Thread-2       : DEBUG   : read : [  128] [   10] (PROMPT-0->)
2019-02-07 18:00:11,557: radical.saga.cpi    : pmgr.0000.launching.0           : Thread-2       : DEBUG   : check state: F
2019-02-07 18:00:11,558: radical.saga.cpi    : pmgr.0000.launching.0           : Thread-2       : DEBUG   : use   state: Done
2019-02-07 18:00:11,558: radical.saga.cpi    : pmgr.0000.launching.0           : Thread-2       : INFO    : Job monitoring thread updating Job [pbspro://localhost/]-[4275618] (old state: Running, new state: Failed)

vivek-bala commented 5 years ago

Based on this, exit code 265 means signal 9. So, the job is being killed (note: not the rp client -- probably not any process limits).

Weiming-Hu commented 5 years ago

I'm afraid the problem persists and I still get the error that umgr_default.json is missing. Here is a detailed breakdown of what I did to help you reproduce what I have.

First, I load the python module after logging onto Cheyenne:

> module purge
> module load python/2.7.15 gnu/7.2.0 hdf5/1.10.1
> module ls
Currently Loaded Modules:
  1) python/2.7.15   2) gnu/7.2.0   3) hdf5/1.10.1

Then, I build a virtual env and install radical.entk, pyyaml, and netcdf.

> virtualenv venv
> source venv/bin/activate
> HDF5_INCDIR=/glade/u/apps/ch/opt/hdf5/1.10.1/gnu/7.1.0/include/ HDF5_LIBDIR=/glade/u/apps/ch/opt/hdf5/1.10.1/gnu/7.1.0/lib pip install pyyaml netCDF4 radical.entk
... [Ignore log file]

And I checked that umgr_default.json file exists.

> > file venv/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py
venv/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py: Python script, ASCII text executable

And entk works:

> entk-version 
0.7.11

It should be properly set up by now. Now run the script.

> python runme.py --wcfg workflow_cfg_cheyenne.yml --rcfg resource_cfg_cheyenne.yml
... [Ignore some logs]
2019-02-08 15:38:10,254: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: ERROR   : Pilot has failed
... [Ignore some more logs]
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 263, in _process_tasks
    umgr = rp.UnitManager(session=rmgr._session)
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 102, in __init__
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/read_json.py", line 26, in read_json
IOError: [Errno 2] No such file or directory: '/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/pilot/configs/umgr_default.json'

Weird! It does not find the file. If I look into the folder, the file is no longer there.

> ls venv/lib/python2.7/site-packages/radical/entk/execman/rp                
__init__.py  __init__.pyc  resource_manager.py  resource_manager.pyc  task_manager.py  task_manager.pyc  task_processor.py  task_processor.pyc

And entk is no longer working.

> entk-version 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/__init__.py", line 4, in <module>
    from radical.entk.pipeline.pipeline import Pipeline
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/pipeline/pipeline.py", line 1, in <module>
    import radical.utils as ru
ImportError: No module named utils

I'm sorry for being verbose. But this is indeed so strange for me.

vivek-bala commented 5 years ago

Just an update that I have dropped an email with Cheyenne support to see why the jobs are being killed.

vivek-bala commented 5 years ago

@Weiming-Hu The set of commands I use to install:

module load python/2.7.15
virtualenv $HOME/ve
source $HOME/ve/bin/activate
pip install radical.entk
pip install netcdf4

Could you give that a try? Also please check that there are no conflicting settings in your bashrc file.

Weiming-Hu commented 5 years ago

I don't have a .bashrc file, but a .bash_profile. I set RADICAL_PILOT_DBURL and RADICAL_ENTK_VERBOSE variable.

Weiming-Hu commented 5 years ago

I have to install pyyaml before I can run the script.

Unfortunately, The same problem persists. Same error message.

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/glade/u/home/wuh20/ve/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 263, in _process_tasks
    umgr = rp.UnitManager(session=rmgr._session)
  File "/glade/u/home/wuh20/ve/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 102, in __init__
  File "/glade/u/home/wuh20/ve/lib/python2.7/site-packages/radical/utils/read_json.py", line 26, in read_json
IOError: [Errno 2] No such file or directory: '/glade/u/home/wuh20/ve/lib/python2.7/site-packages/radical/pilot/configs/umgr_default.json'

vivek-bala commented 5 years ago

I believe Weiming can also reproduce the error that I can now.

I confirmed that Cheyenne is killing our jobs. It seems to be because of memory exhaustion by the tasks on the compute nodes. See response from Cheyenne admin:

Apparently your jobs are killed due to memory exhaustion in the nodes in which they were running.
Out of the 23 jobs that you ran over last 5 days, 4 were successful and 19 failed. The ones that
are failed are due to memory exhaustion according to our logs. But I am little suspicious about our
data as all the pieces do not seem consistent. Is there any way you can share your job suite to me
for myself to submit your jobs and dig into more details ? If yes, please point me to your
directory location where your jobs, scripts etc are there.

I am trying to prepare a single job that can reproduce the error. @Weiming-Hu do you have any profiles or jobs that you ran on Cheyenne? If you already have them, I can use the same to reproduce the error.

andre-merzky commented 5 years ago

@Weiming-Hu , can you please try the following:

install the RCT stack as usual in your VE
run chmod -R a-w ve/lib/python2.7/site-packages/radical
run your application

I hope this triggers a write error for whatever is purging your installation...

vivek-bala commented 5 years ago

@andre-merzky The error with the missing JSON file is now resolved. I had to comment out a PATH setting in his bash_profile. Also, there might have been an error with the netifaces installation, which resolved when upgrading.

andre-merzky commented 5 years ago

@vivek-bala @Weiming-Hu : if memory usage is a problem, you can try adding ulimit -s 1024 to your bashrc. Python by default creates threads with 8MB stack size - the above command would limit that to 1MB. I am not sure if stacksize causes memory issues, but I have seen systems where that's the case.

andre-merzky commented 5 years ago

How did the $PATH setting affect file persistence / deletion? This is funny :-)

vivek-bala commented 5 years ago

Cool, will try that as well. I think we are also reading ~GB of data in our tasks, so I will test if that is the reason as well. (the tasks start but don't complete)

vivek-bala commented 5 years ago

How did the $PATH setting affect file persistence / deletion? This is funny :-)

I don't know really. I also used the fix/cheyenne branch instead of master. Didn't dig into what the actual source is, I simply repeated the steps that I did to create my VE. You want me to find out? (no promises :-P )

andre-merzky commented 5 years ago

Yes, that would be great. Don't get lost in that part of course - but I do remember reports in the past about disappearing installations, so this is not a totally isolated incidence. If you happen to be able to reproduce this outside of Cheyenne, I'd be happy to pick this up myself! Thanks :-)

vivek-bala commented 5 years ago

Update: The original issue has been fixed for Weiming. We have been able to run tasks from the three stages on Cheyenne. The tasks themselves initially failed since multiple tasks (reading ~20-30GB of data) were running on a single compute node, thus barfing the memory. This has been fixed by keeping 1-2 tasks per compute node.

Status: The tasks themselves do not complete due to an error from the MPI layer on Cheyenne.

MPT: shepherd terminated: r11i3n25 - job aborting

We have a ticket with the help desk to resolve this.

Weiming-Hu commented 5 years ago

Thank you, Vivek. Let me know when I should give it a try.

vivek-bala commented 5 years ago

Hey Weiming, you can give the scripts a try now. There were some typos in the script that caused the MPI shepher termination. It should be resolved now.

Weiming-Hu commented 5 years ago

Now it seems that we have a slightly different situation. Pilot has complete, but the final message is in red:

2019-02-17 19:58:30,732: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0027 with state EXECUTED
2019-02-17 19:58:30,955: radical.entk.task_manager.0000: task-manager                    : umgr.0000.idler._state_pull_cb: INFO    : Transition of task.0027 to new state EXECUTED successful
2019-02-17 19:58:30,956: radical.entk.task_manager.0000: task-manager                    : umgr.0000.idler._state_pull_cb: INFO    : Pushed task task.0027 with state EXECUTED to completed queue re.session.cheyenne6.wuh20.017945.0000-completedq-1
2019-02-17 19:58:31,046: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0024 with state FAILED
2019-02-17 19:58:31,252: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Transition of task.0024 to new state FAILED successful
2019-02-17 19:58:31,412: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Got finished task task.0027 from queue
2019-02-17 19:58:31,671: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0027 with state DEQUEUEING
2019-02-17 19:58:31,889: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Transition of task.0027 to new state DEQUEUEING successful
2019-02-17 19:58:32,296: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0027 with state DEQUEUED
2019-02-17 19:58:32,525: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Transition of task.0027 to new state DEQUEUED successful

...

2019-02-17 20:00:21,183: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: DONE
2019-02-17 20:00:21,183: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: ERROR   : Pilot has completed

And there is no output data, and no unit*** folders.

mturilli commented 5 years ago

Mostly addressed, issues on the last stage for thread number configuration.

Weiming-Hu commented 5 years ago

Issue resolved in commit f73b197.

Weiming-Hu commented 5 years ago

Hi. I have again got some errors from this thread.

2019-03-19 11:21:58,557: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: ERROR   : Unknown error in synchronizer: Expected (base) type(s) <type 'str'>, but got <type 'NoneType'>.. 
 Terminating thread
Traceback (most recent call last):
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 888, in _synchronizer
    task_update(msg, '%s-sync-to-cb' % self._sid, props.correlation_id, mq_channel)
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 641, in task_update
    completed_task.from_dict(msg['object'])
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/task/task.py", line 810, in from_dict
    actual_type=type(d['executable']))
TypeError: Expected (base) type(s) <type 'str'>, but got <type 'NoneType'>.

This looks like a problem that we used to have. Does anybody have an idea how to solve this?

Here is my environment.

(venv) wuh20@cheyenne2:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> entk-version 
0.7.15
(venv) wuh20@cheyenne2:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> radical-stack 

  python               : 2.7.15
  pythonpath           : 
  virtualenv           : /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv

  radical.entk         : 0.7.15
  radical.pilot        : 0.50.21-v0.50.21-2-gbfedd8f@fix-cheyenne
  radical.utils        : 0.50.3
  saga                 : 0.50.5

Thank you

Weiming-Hu commented 5 years ago

Looks like even if I revert to the old version and old code, I still get errors.

2019-03-20 14:00:25,335: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: ERROR   : Unknown error in synchronizer: Expected (base) type(s) <type 'list'>, but got <type 'unicode'>.. 
 Terminating thread                                                          
Traceback (most recent call last):                                                                                                         File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv-old/lib/python2.7/site-packages/radical/e
ntk/appman/appmanager.py", line 907, in _synchronizer                        
    task_update(msg, '%s-sync-to-enq' % self._sid, props.correlation_id, mq_channel)                                                       File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv-old/lib/python2.7/site-packages/radical/e
ntk/appman/appmanager.py", line 641, in task_update                          
    completed_task.from_dict(msg['object'])                                                                                                File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv-old/lib/python2.7/site-packages/radical/e
ntk/task/task.py", line 778, in from_dict         
    actual_type=type(d['executable']))                 
TypeError: Expected (base) type(s) <type 'list'>, but got <type 'unicode'>.

Weiming-Hu commented 5 years ago

This has been resolved. @vivek-bala is going to make a hot release for this issue. Waiting on Vivek to close this issue.

mturilli commented 5 years ago

Resolved.

radical-collaboration / hpc-workflows

EnTK failed on Cheyenne #83