radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

EnTK failed on local machine during transition of stages #80

Closed Weiming-Hu closed 5 years ago

Weiming-Hu commented 5 years ago

I've observed consistent failures of EnTK on local machine. Here is the error message.

2019-01-23 21:47:53,111: radical.entk.task_manager.0000: task-manager                    : Thread-1       : ERROR   : Transition of task.0042 to SUBMITTING state failed, error: (-1, "error(104, 'Connecti$
n reset by peer')")                                                                                                                                                                                         
Traceback (most recent call last):                                                                                                                                                                          
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/utils/init_transition.py", line 27, in transition          
    local_prof=profiler)                                                                                                                                                                                    
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/utils/sync_initiator.py", line 24, in sync_with_master     
    properties=pika.BasicProperties(correlation_id=corr_id)                                                                                                                                                 
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2120, in basic_publish      
    mandatory, immediate)                                                                                                                                                                                   
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2207, in publish            
    self._flush_output()                                                                                                                                                                                    
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 1292, in _flush_output      
    *waiters)                                                                                                                                                                                               
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 477, in _flush_output       
    result.reason_text)                                                                                                                                                                                     
ConnectionClosed: (-1, "error(104, 'Connection reset by peer')")                                                                                                                                            
Traceback (most recent call last):                                                                                                                                                                          
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 303, in _process_tasks   
    logger=logger)                                                                                                                                                                                          
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/utils/init_transition.py", line 34, in transition          
    obj.state = old_state                                                                                                                                                                                   
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/task/task.py", line 450, in state                          
    raise TypeError(expected_type=str, actual_type=type(val))                                                                                                                                               
NameError: global name 'val' is not defined                                                                                                                                                                 

Exception in thread Thread-1:                                                                                                                                                                               
Traceback (most recent call last):                                                                                                                                                                          
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner                                                                                                                                    
    self.run()                                                                                                                                                                                              
  File "/usr/lib/python2.7/threading.py", line 754, in run                                                                                                                                                  
    self.__target(*self.__args, **self.__kwargs)                                                                                                                                                            
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 323, in _process_tasks   
    raise EnTKError(ex)                                                                                                                                                                                     
EnTKError: global name 'val' is not defined

Since the sandbox files are large than 10 MB and I can't upload them here, I have sent the files to @vivek-bala's email.

Thank you very much!

Weiming

Weiming-Hu commented 5 years ago

I have observed that when a task takes a relatively long time to finish (10 ~ 15 minutes), the transition will fail. That's is why I'm creating increased the number of tasks trying to make each task smaller and faster, but it still fails during the second stage, which is similarity computation.

vivek-bala commented 5 years ago

Hey Weiming, can you point me to the latest EnTK script that you are using?

Weiming-Hu commented 5 years ago

This is the script I used.

I added a new argument called run-serially in to configuration file. If this is used, only one task is added to a stage to assure that tasks are run serially. I didn't limit the threads because each task is a multi-threaded program. I though if I limit threads, programs are forced to use only one thread. Please let me know if this is a proper practice.

Thank you.

vivek-bala commented 5 years ago

Thanks! I'm checking it out now. Can you add the entire verbose message log please? The source of the error might be before the messages that you have pasted above.

Weiming-Hu commented 5 years ago

I don't have the error log anymore since I closed the terminal. Do you want me to run it again?

vivek-bala commented 5 years ago

Yes, that would be helpful.

Weiming-Hu commented 5 years ago

Can I direct the log into a file?

Weiming-Hu commented 5 years ago

Here is the output.

(venv) wuh20@sapphire:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node$ python runme.py                                                                                                 
2019-01-25 10:17:54,917: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-25 10:17:54,917: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    :                      pid/tid: 5518/MainThread
2019-01-25 10:17:54,918: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Application Manager initialized         
2019-01-25 10:17:55,221: radical.entk.task_processor: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-25 10:17:55,221: radical.entk.task_processor: MainProcess                     : MainThread     : INFO    :                      pid/tid: 5518/MainThread              
2019-01-25 10:17:55,225: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-25 10:17:55,225: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    :                      pid/tid: 5518/MainThread
2019-01-25 10:17:55,226: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Created resource manager object: resource_manager.0000
2019-01-25 10:17:55,226: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Resource description validated
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201801.nc Done!                                             
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201802.nc Done!                                             
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/observations/201801.nc Done!                                          
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/observations/201802.nc Done!                                          
Creating standard deviation task task-sd-calc-00000                                                                                                                                                                               
Creating standard deviation task task-sd-calc-00001                                                                                                                                                                        
Creating standard deviation task task-sd-calc-00002                                                                                              
Creating standard deviation task task-sd-calc-00003                                                                                                                                                                       
Creating similarity task task-sims-calc-00000                                                                                                   
Creating similarity task task-sims-calc-00001                                                                                                    
Creating similarity task task-sims-calc-00002                                                                                                                 
Creating similarity task task-sims-calc-00003                                                                                                                                                    
Creating analog selection task task-analog-select-00000                                                                                                                                                                                                              
Creating analog selection task task-analog-select-00001                                                                                                    
Creating analog selection task task-analog-select-00002                                                                                                         
Creating analog selection task task-analog-select-00003                                                                                                                        
2019-01-25 10:17:55,645: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Workflow assigned to Application Manager  
Each stage only has one task. Run tasks in serial.                                                                                                                           
2019-01-25 10:17:57,652: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-25 10:17:57,652: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    :                      pid/tid: 5518/MainThread                        
2019-01-25 10:17:57,653: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    : Created WFProcessor object: wfprocessor.0000
2019-01-25 10:17:57,769: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting resource request submission    
2019-01-25 10:18:00,795: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Pilot pilot.0000 state: PMGR_LAUNCHING_PENDING  
2019-01-25 10:18:00,796: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Resource request submission successful.. waiting for pilot to go Active
2019-01-25 10:18:00,798: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: PMGR_LAUNCHING              
2019-01-25 10:18:07,444: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: PMGR_ACTIVE_PENDING                                                                             
2019-01-25 10:18:30,919: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.idler._state_pull_cb: INFO    : Pilot pilot.0000 state: PMGR_ACTIVE
2019-01-25 10:18:30,948: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Pilot is now active                                                            
2019-01-25 10:18:30,949: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting synchronizer thread                                                                               
2019-01-25 10:18:30,949: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : synchronizer thread started           
2019-01-25 10:18:30,950: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting WFProcessor process from AppManager                                  
2019-01-25 10:18:30,952: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    : Starting WFprocessor process         
2019-01-25 10:18:30,962: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : WFprocessor started                                                               
2019-01-25 10:18:30,963: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Starting dequeue-thread               
2019-01-25 10:18:30,964: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Starting enqueue-thread                                                          
2019-01-25 10:18:30,964: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-25 10:18:30,964: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Dequeue thread started
2019-01-25 10:18:30,964: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    :                      pid/tid: 5518/MainThread
2019-01-25 10:18:30,964: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : enqueue-thread started
2019-01-25 10:18:32,013: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received pipeline.0000 with state SCHEDULING
2019-01-25 10:18:32,014: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found pipeline pipeline.0000, state SCHEDULING, completed False
2019-01-25 10:18:32,259: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of pipeline.0000 to new state SCHEDULING successful
2019-01-25 10:18:32,539: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received stage.0000 with state SCHEDULING
2019-01-25 10:18:32,540: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found parent pipeline: pipeline.0000
2019-01-25 10:18:32,547: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Created task manager object: task_manager.0000
2019-01-25 10:18:32,547: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting task manager process from AppManager
2019-01-25 10:18:32,548: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Starting task manager process
2019-01-25 10:18:32,552: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Starting heartbeat thread
2019-01-25 10:18:32,557: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Task Manager process started
2019-01-25 10:18:32,689: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of stage.0000 to new state SCHEDULING successful
2019-01-25 10:18:33,068: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0000 with state SCHEDULING
2019-01-25 10:18:33,220: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:18:33,261: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of task.0000 to new state SCHEDULING successful
2019-01-25 10:18:33,595: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0000 with state SCHEDULED
2019-01-25 10:18:33,834: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of task.0000 to new state SCHEDULED successful
2019-01-25 10:18:33,865: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:18:33,865: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pymongo/topology.py:149: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: h
ttp://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
  "MongoClient opened before fork. Create MongoClient only "
2019-01-25 10:18:34,121: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received stage.0000 with state SCHEDULED
2019-01-25 10:18:34,122: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found parent pipeline: pipeline.0000
2019-01-25 10:18:34,406: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of stage.0000 to new state SCHEDULED successful
2019-01-25 10:18:35,436: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0000 with state SUBMITTING
2019-01-25 10:18:35,582: radical.entk.task_manager.0000: task-manager                    : Thread-1       : INFO    : Transition of task.0000 to new state SUBMITTING successful
2019-01-25 10:18:35,963: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0000 with state SUBMITTED
2019-01-25 10:18:36,226: radical.entk.task_manager.0000: task-manager                    : Thread-1       : INFO    : Transition of task.0000 to new state SUBMITTED successful
2019-01-25 10:19:03,352: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:19:03,353: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:19:03,585: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:19:03,586: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:19:33,486: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:19:33,487: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:19:33,880: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:19:33,881: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:20:03,620: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:20:03,621: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:20:03,903: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:20:03,904: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:20:33,753: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:20:33,755: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:20:33,903: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:20:33,904: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:21:03,888: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:21:03,890: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:21:04,198: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:21:04,199: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:21:34,023: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:21:34,023: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:21:34,212: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:21:34,213: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:21:46,317: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0000 with state EXECUTED
2019-01-25 10:21:46,543: radical.entk.task_manager.0000: task-manager                    : umgr.0000.subscriber._state_sub_cb: INFO    : Transition of task.0000 to new state EXECUTED successful
2019-01-25 10:21:46,544: radical.entk.task_manager.0000: task-manager                    : umgr.0000.subscriber._state_sub_cb: INFO    : Pushed task task.0000 with state EXECUTED to completed queue re.session.sapphire.geog.psu.edu.wuh20.017921.0003-completedq-1
2019-01-25 10:21:46,763: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Got finished task task.0000 from queue
2019-01-25 10:21:47,106: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0000 with state DEQUEUEING
2019-01-25 10:21:47,335: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Transition of task.0000 to new state DEQUEUEING successful
2019-01-25 10:21:47,632: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0000 with state DEQUEUED
2019-01-25 10:21:47,907: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Transition of task.0000 to new state DEQUEUED successful
2019-01-25 10:21:48,160: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0000 with state DONE
2019-01-25 10:21:48,337: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Transition of task.0000 to new state DONE successful
2019-01-25 10:21:48,687: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received stage.0000 with state DONE
2019-01-25 10:21:48,687: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found parent pipeline: pipeline.0000
2019-01-25 10:21:48,908: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Transition of stage.0000 to new state DONE successful
2019-01-25 10:21:49,082: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received stage.0001 with state SCHEDULING
2019-01-25 10:21:49,082: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found parent pipeline: pipeline.0000
2019-01-25 10:21:49,339: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of stage.0001 to new state SCHEDULING successful
2019-01-25 10:21:49,609: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0001 with state SCHEDULING
2019-01-25 10:21:49,768: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of task.0001 to new state SCHEDULING successful
2019-01-25 10:21:50,109: radical.entk.task_manager.0000: task-manager                    : Thread-1       : ERROR   : Transition of task.0001 to SUBMITTING state failed, error: (-1, "error(104, 'Connection reset by peer')")
Traceback (most recent call last):
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/utils/init_transition.py", line 27, in transition
    local_prof=profiler)
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/utils/sync_initiator.py", line 24, in sync_with_master
    properties=pika.BasicProperties(correlation_id=corr_id)
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2120, in basic_publish
    mandatory, immediate)
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2207, in publish
self._flush_output()
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 1292, in _flush_output
    *waiters)
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 477, in _flush_output
    result.reason_text)
ConnectionClosed: (-1, "error(104, 'Connection reset by peer')")
Traceback (most recent call last):
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 303, in _process_tasks
    logger=logger)
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/utils/init_transition.py", line 34, in transition
    obj.state = old_state
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/task/task.py", line 450, in state
    raise TypeError(expected_type=str, actual_type=type(val))
NameError: global name 'val' is not defined

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 323, in _process_tasks
    raise EnTKError(ex)
EnTKError: global name 'val' is not defined

2019-01-25 10:21:50,135: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0001 with state SCHEDULED
2019-01-25 10:21:50,341: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of task.0001 to new state SCHEDULED successful
2019-01-25 10:21:50,662: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received stage.0001 with state SCHEDULED
2019-01-25 10:21:50,663: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found parent pipeline: pipeline.0000
2019-01-25 10:21:50,914: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of stage.0001 to new state SCHEDULED successful
2019-01-25 10:22:04,156: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:22:04,157: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:22:04,483: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:22:04,483: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:22:34,291: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:22:34,292: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:22:34,473: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:22:34,474: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:23:04,425: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:23:04,426: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:23:04,753: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:23:04,753: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:23:34,560: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:23:34,560: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:23:34,724: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:23:34,724: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:24:04,694: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:24:04,695: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:24:04,980: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:24:04,980: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:24:34,828: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:24:34,829: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:24:35,250: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:24:35,251: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:25:04,962: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:25:04,964: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:25:05,209: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:25:05,210: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:25:35,099: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:25:35,101: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:25:35,471: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:25:35,471: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:26:05,234: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:26:05,235: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:26:05,460: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:26:05,460: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:26:35,367: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:26:35,368: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:26:35,738: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:26:35,739: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:27:05,501: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:27:05,502: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:27:05,708: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:27:05,708: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:27:35,637: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:27:35,638: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:27:35,980: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-25 10:27:35,981: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-25 10:28:05,772: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-25 10:28:05,774: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-25 10:28:05,965: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
^C2019-01-25 10:52:50,250: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : ERROR   : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel wfprocessor process gracefully...
2019-01-25 10:52:50,250: radical.entk.appmanager.0000: MainProcess                     : MainThread     : ERROR   : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel enqueuer thread gracefully...
2019-01-25 10:52:50,250: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating WFprocessor
2019-01-25 10:52:50,250: radical.entk.task_manager.0000: task-manager                    : MainThread     : ERROR   : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel tmgr process gracefully...
2019-01-25 10:52:50,250: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Terminating enqueue-thread
2019-01-25 10:52:50,251: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Enqueue thread terminated
2019-01-25 10:52:50,538: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Terminating dequeue-thread
2019-01-25 10:52:50,593: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Terminated dequeue thread
2019-01-25 10:52:50,880: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : WFprocessor process terminated
Process wfprocessor:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 555, in _wfp
    raise KeyboardInterrupt
KeyboardInterrupt
^CTraceback (most recent call last):
  File "runme.py", line 533, in <module>
    amgr.run()
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 456, in run
    self._wfp.terminate_processor()
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 631, in terminate_processor
    'Attempting to end WFprocessor... event: %s' % self._wfp_terminate.is_set())
  File "/usr/lib/python2.7/multiprocessing/synchronize.py", line 306, in is_set
    self._cond.acquire()
KeyboardInterrupt
vivek-bala commented 5 years ago

Thanks Weiming! I'll take a look at all of this and get back to you.

vivek-bala commented 5 years ago

New error at the saga level:

19-01-31 14:26:28,536: pmgr.0000.launching.0.child: pmgr.0000.launching.0           : MainThread     : ERROR   : bulk launc│···········································································································
h failed                                                                                                                     │···········································································································
Traceback (most recent call last):                                                                                           │···········································································································
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-pa│···········································································································
ckages/radical/pilot/pmgr/launching/default.py", line 488, in work                                                           │···········································································································
    self._start_pilot_bulk(resource, schema, pilots)                                                                         │···········································································································
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-pa│···········································································································
ckages/radical/pilot/pmgr/launching/default.py", line 697, in _start_pilot_bulk                                              │···········································································································
    if j.state == rs.FAILED:                                                                                                 │···········································································································
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-pa│···········································································································
ckages/saga/job/job.py", line 623, in get_state                                                                              │···········································································································
    return self._adaptor.get_state (ttype=ttype)                                                                             │···········································································································
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-pa│···········································································································
ckages/saga/adaptors/cpi/decorators.py", line 57, in wrap_function                                                           │···········································································································
    return sync_function (self, *args, **kwargs)                                                                             │···········································································································
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-pa│···········································································································
ckages/saga/adaptors/shell/shell_job.py", line 1411, in get_state                                                            │···········································································································
    stats = self.js._job_get_stats (self._id)                                                                                │···········································································································
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-pa│···········································································································
ckages/saga/adaptors/shell/shell_job.py", line 760, in _job_get_stats                                                        │···········································································································
    % (id, ret, out))                                                                                                        │···········································································································
NoSuccess: failed to get job stats for '[fork://localhost/]-[30438.0]': (127)(STATS 30438.0                                  │···········································································································
/bin/sh: 5: STATS: not found                                                     

Hey @andre-merzky , do you have any idea about this error? We encountered this error when executing locally.

Full logs: re.session.sapphire.geog.psu.edu.wuh20.017927.0002.zip

vivek-bala commented 5 years ago

The error was persistent after removing $HOME/.radical and $HOME/.saga as well.

andre-merzky commented 5 years ago

Hey @vivek-bala : can you please send me the stack info? Is this reproducible on another machine? Please rerun it with RADICAL_SAGA_PTY_VERBOSE=DEBUG and attach the resulting log file. Thanks!

vivek-bala commented 5 years ago

Hey @Weiming-Hu , can you give this a try please. You can get the stack info by executing radical-stack.

Weiming-Hu commented 5 years ago

How would I get the stack info?

vivek-bala commented 5 years ago

By running radical-stack in the command line after sourcing your virtual environment.

Weiming-Hu commented 5 years ago

I got the following error messages.

$ export RADICAL_SAGA_PTY_VERBOSE=DEBUG
$ python runme.py 
2019-01-31 15:17:17,454: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-31 15:17:17,454: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    :                      pid/tid: 6271/MainThread
2019-01-31 15:17:17,454: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Application Manager initialized
2019-01-31 15:17:17,646: radical.entk.task_processor: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-31 15:17:17,646: radical.entk.task_processor: MainProcess                     : MainThread     : INFO    :                      pid/tid: 6271/MainThread
2019-01-31 15:17:17,650: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-31 15:17:17,650: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    :                      pid/tid: 6271/MainThread
2019-01-31 15:17:17,650: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Created resource manager object: resource_manager.0000
2019-01-31 15:17:17,650: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Resource description validated
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201801.nc Done!
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201802.nc Done!
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/observations/201801.nc Done!
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/observations/201802.nc Done!
Creating standard deviation task task-sd-calc-00000
Creating standard deviation task task-sd-calc-00001
Creating standard deviation task task-sd-calc-00002
Creating standard deviation task task-sd-calc-00003
Creating similarity task task-sims-calc-00000
Creating similarity task task-sims-calc-00001
Creating similarity task task-sims-calc-00002
Creating similarity task task-sims-calc-00003
Creating analog selection task task-analog-select-00000
Creating analog selection task task-analog-select-00001
Creating analog selection task task-analog-select-00002
Creating analog selection task task-analog-select-00003
2019-01-31 15:17:17,659: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Workflow assigned to Application Manager
Each stage only has one task. Run tasks in serial.
2019-01-31 15:17:19,661: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-31 15:17:19,662: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    :                      pid/tid: 6271/MainThread
2019-01-31 15:17:19,662: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    : Created WFProcessor object: wfprocessor.0000
2019-01-31 15:17:19,774: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting resource request submission
2019-01-31 15:17:22,206: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Pilot pilot.0000 state: PMGR_LAUNCHING_PENDING
2019-01-31 15:17:22,207: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Resource request submission successful.. waiting for pilot to go Active
2019-01-31 15:17:22,209: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: PMGR_LAUNCHING
2019-01-31 15:17:26,792: radical.saga.cpi    : pmgr.0000.launching.0           : MainThread     : ERROR   : flushing failed
Traceback (most recent call last):
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/saga/utils/pty_process.py", line 174, in flush
    tmp = self.read (timeout=0.1)
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/saga/utils/pty_process.py", line 679, in read
    % (e, self.tail))
NoSuccess: read from process failed '[Errno 5] Input/output error' : (wuh20@sapphire:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node$ $$PROMPT-0->PROMPT-0->PROMPT-0->PID: 6587
PROMPT-0->

RUNNING 
OK
6696.0
PROMPT-0->
OK
BULK COMPLETED
PROMPT-0->

trapped EXIT
cmd_quit called (1)PROMPT-1->
) (/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/saga/utils/pty_process.py +679 (read)  :  % (e, self.tail)))
2019-01-31 15:17:26,795: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: FAILED
2019-01-31 15:17:26,795: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: ERROR   : Pilot has failed
2019-01-31 15:17:26,816: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Pilot is now active
2019-01-31 15:17:26,816: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting synchronizer thread
2019-01-31 15:17:26,817: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : synchronizer thread started
2019-01-31 15:17:26,817: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting WFProcessor process from AppManager
2019-01-31 15:17:26,819: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    : Starting WFprocessor process
2019-01-31 15:17:26,831: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : WFprocessor started
2019-01-31 15:17:26,832: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Starting dequeue-thread
2019-01-31 15:17:26,832: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Dequeue thread started
2019-01-31 15:17:26,833: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-31 15:17:26,833: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Starting enqueue-thread
2019-01-31 15:17:26,833: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    :                      pid/tid: 6271/MainThread
2019-01-31 15:17:26,835: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : enqueue-thread started
2019-01-31 15:17:27,884: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received pipeline.0000 with state SCHEDULING
2019-01-31 15:17:27,885: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found pipeline pipeline.0000, state SCHEDULING, completed False
2019-01-31 15:17:28,025: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of pipeline.0000 to new state SCHEDULING successful
2019-01-31 15:17:28,412: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received stage.0000 with state SCHEDULING
2019-01-31 15:17:28,412: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found parent pipeline: pipeline.0000
2019-01-31 15:17:28,419: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Created task manager object: task_manager.0000
2019-01-31 15:17:28,419: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting task manager process from AppManager
2019-01-31 15:17:28,420: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Starting task manager process
2019-01-31 15:17:28,424: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Starting heartbeat thread
2019-01-31 15:17:28,427: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating WFprocessor
2019-01-31 15:17:28,428: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Terminating enqueue-thread
2019-01-31 15:17:28,430: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Task Manager process started
2019-01-31 15:17:28,550: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of stage.0000 to new state SCHEDULING successful
2019-01-31 15:17:28,940: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0000 with state SCHEDULING
2019-01-31 15:17:29,075: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of task.0000 to new state SCHEDULING successful
2019-01-31 15:17:29,089: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-31 15:17:29,468: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0000 with state SCHEDULED
2019-01-31 15:17:29,629: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-31 15:17:29,638: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pymongo/topology.py:149: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
  "MongoClient opened before fork. Create MongoClient only "
2019-01-31 15:17:29,730: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of task.0000 to new state SCHEDULED successful
2019-01-31 15:17:29,995: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received stage.0000 with state SCHEDULED
2019-01-31 15:17:29,995: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found parent pipeline: pipeline.0000
2019-01-31 15:17:30,254: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of stage.0000 to new state SCHEDULED successful
2019-01-31 15:17:30,255: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Enqueue thread terminated
2019-01-31 15:17:30,518: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Terminating dequeue-thread
2019-01-31 15:17:30,559: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Terminated dequeue thread
2019-01-31 15:17:30,850: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating synchronizer thread
2019-01-31 15:17:31,178: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Synchronizer thread terminated
2019-01-31 15:17:31,178: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating task manager process
2019-01-31 15:17:59,221: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-31 15:17:59,222: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
^C2019-01-31 15:18:15,825: radical.entk.appmanager.0000: MainProcess                     : MainThread     : ERROR   : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel enqueuer thread gracefully...
2019-01-31 15:18:15,825: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating WFprocessor
2019-01-31 15:18:15,825: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating task manager process
2019-01-31 15:18:16,144: radical.saga.cpi    : pmgr.0000.launching.0           : MainThread     : ERROR   : flushing failed
Traceback (most recent call last):
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/saga/utils/pty_process.py", line 174, in flush
    tmp = self.read (timeout=0.1)
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/saga/utils/pty_process.py", line 679, in read
    % (e, self.tail))
NoSuccess: read from process failed '[Errno 5] Input/output error' : (wuh20@sapphire:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node$ $$PROMPT-0->PROMPT-0->PROMPT-0->PID: 6587
PROMPT-0->

RUNNING 
OK
6696.0
PROMPT-0->
OK
BULK COMPLETED
PROMPT-0->

trapped EXIT
cmd_quit called (1)PROMPT-1->
) (/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/saga/utils/pty_process.py +679 (read)  :  % (e, self.tail)))
^CTraceback (most recent call last):
  File "runme.py", line 533, in <module>
    amgr.run()
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 460, in run
    self._task_manager.terminate_manager()
  File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/execman/base/task_manager.py", line 300, in terminate_manager
    self._tmgr_process.join()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 148, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
    return self.poll(0)
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Weiming-Hu commented 5 years ago

Thank you.

$ radical-stack

  python               : 2.7.15rc1
  pythonpath           : 
  virtualenv           : /home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv

  radical.entk         : 0.7.11
  radical.pilot        : 0.50.21
  radical.utils        : 0.50.3
  saga                 : 0.50.5
andre-merzky commented 5 years ago

Thanks for the stack. I'm afraid I still don't see the log messages I am looking for. Could you please also set RADICAL_LOG_TGT=radical.log to make sure those are actually recorded, and attach that file then, too? Thanks!

Weiming-Hu commented 5 years ago

Hi. Sorry for my late update.

Below is the error message from console std:

$ python runme.py                                         
2019-02-05 11:56:43,555: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-02-05 11:56:43,555: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    :                      pid/tid: 31691/MainThread                                                           
2019-02-05 11:56:43,555: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Application Manager initialized
2019-02-05 11:56:43,862: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-02-05 11:56:43,863: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    :                      pid/tid: 31691/MainThread
2019-02-05 11:56:43,863: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Created resource manager object: resource_manager.0000
2019-02-05 11:56:43,863: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Resource description validated                          
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201801.nc Done!                                                                                    
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201802.nc Done!                                                                             
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/observations/201801.nc Done!                                                                                                       
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/observations/201802.nc Done!                                                                                           
Creating standard deviation task task-sd-calc-00000                      
Creating standard deviation task task-sd-calc-00001
Creating standard deviation task task-sd-calc-00002                      
Creating standard deviation task task-sd-calc-00003
Creating similarity task task-sims-calc-00000                            
Creating similarity task task-sims-calc-00001
Creating similarity task task-sims-calc-00002
Creating similarity task task-sims-calc-00003
Creating analog selection task task-analog-select-00000
Creating analog selection task task-analog-select-00001          
Creating analog selection task task-analog-select-00002
Creating analog selection task task-analog-select-00003                         
2019-02-05 11:56:44,168: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Workflow assigned to Application Manager
Each stage only has one task. Run tasks in serial.                       
2019-02-05 11:56:46,774: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-02-05 11:56:46,775: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    :                      pid/tid: 31691/MainThread
2019-02-05 11:56:46,775: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    : Created WFProcessor object: wfprocessor.0000
2019-02-05 11:56:46,858: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting resource request submission
2019-02-05 11:56:50,219: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Pilot pilot.0000 state: PMGR_LAUNCHING_PENDING
2019-02-05 11:56:50,220: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Resource request submission successful.. waiting for pilot to go Active
2019-02-05 11:56:50,222: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: PMGR_LAUNCHING
2019-02-05 11:56:54,355: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: FAILED
2019-02-05 11:56:54,356: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: ERROR   : Pilot has failed
2019-02-05 11:56:54,427: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : INFO    : Pilot is now active
2019-02-05 11:56:54,428: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting synchronizer thread
2019-02-05 11:56:54,428: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : synchronizer thread started
2019-02-05 11:56:54,429: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting WFProcessor process from AppManager
2019-02-05 11:56:54,431: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : INFO    : Starting WFprocessor process
2019-02-05 11:56:54,441: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : WFprocessor started
2019-02-05 11:56:54,442: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Starting dequeue-thread
2019-02-05 11:56:54,442: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-02-05 11:56:54,443: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    :                      pid/tid: 31691/MainThread
2019-02-05 11:56:54,443: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Dequeue thread started
2019-02-05 11:56:54,443: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Starting enqueue-thread
2019-02-05 11:56:54,445: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : enqueue-thread started                                                                                                   
2019-02-05 11:56:55,584: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received pipeline.0000 with state SCHEDULING                                                    
2019-02-05 11:56:55,585: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found pipeline pipeline.0000, state SCHEDULING, completed False
2019-02-05 11:56:55,741: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of pipeline.0000 to new state SCHEDULING successful
2019-02-05 11:56:55,906: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Created task manager object: task_manager.0000
2019-02-05 11:56:55,906: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Starting task manager process from AppManager
2019-02-05 11:56:55,907: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Starting task manager process
2019-02-05 11:56:55,911: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Starting heartbeat thread
2019-02-05 11:56:55,914: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating WFprocessor
2019-02-05 11:56:55,916: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Terminating enqueue-thread
2019-02-05 11:56:55,916: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Task Manager process started
2019-02-05 11:56:56,157: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received stage.0000 with state SCHEDULING
2019-02-05 11:56:56,157: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found parent pipeline: pipeline.0000
2019-02-05 11:56:56,313: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of stage.0000 to new state SCHEDULING successful
2019-02-05 11:56:56,636: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-02-05 11:56:56,729: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0000 with state SCHEDULING
2019-02-05 11:56:56,885: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of task.0000 to new state SCHEDULING successful                        
2019-02-05 11:56:57,220: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request                 
2019-02-05 11:56:57,221: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response                                                   
/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pymongo/topology.py:149: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: h
ttp://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe                                                         
  "MongoClient opened before fork. Create MongoClient only "
2019-02-05 11:56:57,301: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0000 with state SCHEDULED                   
2019-02-05 11:56:57,458: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of task.0000 to new state SCHEDULED successful
2019-02-05 11:56:57,873: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received stage.0000 with state SCHEDULED                 
2019-02-05 11:56:57,874: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Found parent pipeline: pipeline.0000
2019-02-05 11:56:58,030: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of stage.0000 to new state SCHEDULED successful                   
2019-02-05 11:56:58,030: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Enqueue thread terminated
2019-02-05 11:56:58,317: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Terminating dequeue-thread
2019-02-05 11:56:58,398: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Terminated dequeue thread
2019-02-05 11:56:58,666: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating synchronizer thread
2019-02-05 11:56:58,730: radical.entk.appmanager.0000: MainProcess                     : synchronizer-thread: INFO    : Received task.0000 with state SUBMITTING
2019-02-05 11:56:58,952: radical.entk.task_manager.0000: task-manager                    : Thread-1       : INFO    : Transition of task.0000 to new state SUBMITTING successful
2019-02-05 11:56:59,159: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Synchronizer thread terminated
2019-02-05 11:56:59,159: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating task manager process
2019-02-05 11:57:26,780: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-02-05 11:57:26,781: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
^C2019-02-05 13:07:40,012: radical.entk.appmanager.0000: MainProcess                     : MainThread     : ERROR   : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel enqueuer thread gracefully...
2019-02-05 13:07:40,017: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating WFprocessor
2019-02-05 13:07:40,017: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating task manager process 

This is the log file.

radical.log

Thank you

mturilli commented 5 years ago

Related to #83

Weiming-Hu commented 5 years ago

Taking this offline since we already have a working version from the commit f73b197.