Closed Weiming-Hu closed 5 years ago
I have observed that when a task takes a relatively long time to finish (10 ~ 15 minutes), the transition will fail. That's is why I'm creating increased the number of tasks trying to make each task smaller and faster, but it still fails during the second stage, which is similarity computation.
Hey Weiming, can you point me to the latest EnTK script that you are using?
This is the script I used.
I added a new argument called run-serially
in to configuration file. If this is used, only one task is added to a stage to assure that tasks are run serially. I didn't limit the threads because each task is a multi-threaded program. I though if I limit threads, programs are forced to use only one thread. Please let me know if this is a proper practice.
Thank you.
Thanks! I'm checking it out now. Can you add the entire verbose message log please? The source of the error might be before the messages that you have pasted above.
I don't have the error log anymore since I closed the terminal. Do you want me to run it again?
Yes, that would be helpful.
Can I direct the log into a file?
Here is the output.
(venv) wuh20@sapphire:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node$ python runme.py
2019-01-25 10:17:54,917: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-25 10:17:54,917: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : pid/tid: 5518/MainThread
2019-01-25 10:17:54,918: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Application Manager initialized
2019-01-25 10:17:55,221: radical.entk.task_processor: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-25 10:17:55,221: radical.entk.task_processor: MainProcess : MainThread : INFO : pid/tid: 5518/MainThread
2019-01-25 10:17:55,225: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-25 10:17:55,225: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : pid/tid: 5518/MainThread
2019-01-25 10:17:55,226: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Created resource manager object: resource_manager.0000
2019-01-25 10:17:55,226: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Resource description validated
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201801.nc Done!
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201802.nc Done!
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/observations/201801.nc Done!
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/observations/201802.nc Done!
Creating standard deviation task task-sd-calc-00000
Creating standard deviation task task-sd-calc-00001
Creating standard deviation task task-sd-calc-00002
Creating standard deviation task task-sd-calc-00003
Creating similarity task task-sims-calc-00000
Creating similarity task task-sims-calc-00001
Creating similarity task task-sims-calc-00002
Creating similarity task task-sims-calc-00003
Creating analog selection task task-analog-select-00000
Creating analog selection task task-analog-select-00001
Creating analog selection task task-analog-select-00002
Creating analog selection task task-analog-select-00003
2019-01-25 10:17:55,645: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Workflow assigned to Application Manager
Each stage only has one task. Run tasks in serial.
2019-01-25 10:17:57,652: radical.entk.wfprocessor.0000: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-25 10:17:57,652: radical.entk.wfprocessor.0000: MainProcess : MainThread : INFO : pid/tid: 5518/MainThread
2019-01-25 10:17:57,653: radical.entk.wfprocessor.0000: MainProcess : MainThread : INFO : Created WFProcessor object: wfprocessor.0000
2019-01-25 10:17:57,769: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Starting resource request submission
2019-01-25 10:18:00,795: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Pilot pilot.0000 state: PMGR_LAUNCHING_PENDING
2019-01-25 10:18:00,796: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Resource request submission successful.. waiting for pilot to go Active
2019-01-25 10:18:00,798: radical.entk.resource_manager.0000: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: PMGR_LAUNCHING
2019-01-25 10:18:07,444: radical.entk.resource_manager.0000: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: PMGR_ACTIVE_PENDING
2019-01-25 10:18:30,919: radical.entk.resource_manager.0000: MainProcess : pmgr.0000.idler._state_pull_cb: INFO : Pilot pilot.0000 state: PMGR_ACTIVE
2019-01-25 10:18:30,948: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Pilot is now active
2019-01-25 10:18:30,949: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Starting synchronizer thread
2019-01-25 10:18:30,949: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : synchronizer thread started
2019-01-25 10:18:30,950: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Starting WFProcessor process from AppManager
2019-01-25 10:18:30,952: radical.entk.wfprocessor.0000: MainProcess : MainThread : INFO : Starting WFprocessor process
2019-01-25 10:18:30,962: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : WFprocessor started
2019-01-25 10:18:30,963: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Starting dequeue-thread
2019-01-25 10:18:30,964: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Starting enqueue-thread
2019-01-25 10:18:30,964: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-25 10:18:30,964: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Dequeue thread started
2019-01-25 10:18:30,964: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : pid/tid: 5518/MainThread
2019-01-25 10:18:30,964: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : enqueue-thread started
2019-01-25 10:18:32,013: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received pipeline.0000 with state SCHEDULING
2019-01-25 10:18:32,014: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Found pipeline pipeline.0000, state SCHEDULING, completed False
2019-01-25 10:18:32,259: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of pipeline.0000 to new state SCHEDULING successful
2019-01-25 10:18:32,539: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received stage.0000 with state SCHEDULING
2019-01-25 10:18:32,540: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Found parent pipeline: pipeline.0000
2019-01-25 10:18:32,547: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : Created task manager object: task_manager.0000
2019-01-25 10:18:32,547: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Starting task manager process from AppManager
2019-01-25 10:18:32,548: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : Starting task manager process
2019-01-25 10:18:32,552: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : Starting heartbeat thread
2019-01-25 10:18:32,557: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Task Manager process started
2019-01-25 10:18:32,689: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of stage.0000 to new state SCHEDULING successful
2019-01-25 10:18:33,068: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0000 with state SCHEDULING
2019-01-25 10:18:33,220: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:18:33,261: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of task.0000 to new state SCHEDULING successful
2019-01-25 10:18:33,595: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0000 with state SCHEDULED
2019-01-25 10:18:33,834: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of task.0000 to new state SCHEDULED successful
2019-01-25 10:18:33,865: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:18:33,865: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pymongo/topology.py:149: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: h
ttp://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
"MongoClient opened before fork. Create MongoClient only "
2019-01-25 10:18:34,121: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received stage.0000 with state SCHEDULED
2019-01-25 10:18:34,122: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Found parent pipeline: pipeline.0000
2019-01-25 10:18:34,406: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of stage.0000 to new state SCHEDULED successful
2019-01-25 10:18:35,436: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0000 with state SUBMITTING
2019-01-25 10:18:35,582: radical.entk.task_manager.0000: task-manager : Thread-1 : INFO : Transition of task.0000 to new state SUBMITTING successful
2019-01-25 10:18:35,963: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0000 with state SUBMITTED
2019-01-25 10:18:36,226: radical.entk.task_manager.0000: task-manager : Thread-1 : INFO : Transition of task.0000 to new state SUBMITTED successful
2019-01-25 10:19:03,352: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:19:03,353: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:19:03,585: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:19:03,586: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:19:33,486: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:19:33,487: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:19:33,880: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:19:33,881: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:20:03,620: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:20:03,621: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:20:03,903: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:20:03,904: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:20:33,753: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:20:33,755: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:20:33,903: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:20:33,904: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:21:03,888: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:21:03,890: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:21:04,198: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:21:04,199: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:21:34,023: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:21:34,023: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:21:34,212: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:21:34,213: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:21:46,317: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0000 with state EXECUTED
2019-01-25 10:21:46,543: radical.entk.task_manager.0000: task-manager : umgr.0000.subscriber._state_sub_cb: INFO : Transition of task.0000 to new state EXECUTED successful
2019-01-25 10:21:46,544: radical.entk.task_manager.0000: task-manager : umgr.0000.subscriber._state_sub_cb: INFO : Pushed task task.0000 with state EXECUTED to completed queue re.session.sapphire.geog.psu.edu.wuh20.017921.0003-completedq-1
2019-01-25 10:21:46,763: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Got finished task task.0000 from queue
2019-01-25 10:21:47,106: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0000 with state DEQUEUEING
2019-01-25 10:21:47,335: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Transition of task.0000 to new state DEQUEUEING successful
2019-01-25 10:21:47,632: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0000 with state DEQUEUED
2019-01-25 10:21:47,907: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Transition of task.0000 to new state DEQUEUED successful
2019-01-25 10:21:48,160: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0000 with state DONE
2019-01-25 10:21:48,337: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Transition of task.0000 to new state DONE successful
2019-01-25 10:21:48,687: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received stage.0000 with state DONE
2019-01-25 10:21:48,687: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Found parent pipeline: pipeline.0000
2019-01-25 10:21:48,908: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Transition of stage.0000 to new state DONE successful
2019-01-25 10:21:49,082: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received stage.0001 with state SCHEDULING
2019-01-25 10:21:49,082: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Found parent pipeline: pipeline.0000
2019-01-25 10:21:49,339: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of stage.0001 to new state SCHEDULING successful
2019-01-25 10:21:49,609: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0001 with state SCHEDULING
2019-01-25 10:21:49,768: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of task.0001 to new state SCHEDULING successful
2019-01-25 10:21:50,109: radical.entk.task_manager.0000: task-manager : Thread-1 : ERROR : Transition of task.0001 to SUBMITTING state failed, error: (-1, "error(104, 'Connection reset by peer')")
Traceback (most recent call last):
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/utils/init_transition.py", line 27, in transition
local_prof=profiler)
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/utils/sync_initiator.py", line 24, in sync_with_master
properties=pika.BasicProperties(correlation_id=corr_id)
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2120, in basic_publish
mandatory, immediate)
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2207, in publish
self._flush_output()
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 1292, in _flush_output
*waiters)
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 477, in _flush_output
result.reason_text)
ConnectionClosed: (-1, "error(104, 'Connection reset by peer')")
Traceback (most recent call last):
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 303, in _process_tasks
logger=logger)
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/utils/init_transition.py", line 34, in transition
obj.state = old_state
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/task/task.py", line 450, in state
raise TypeError(expected_type=str, actual_type=type(val))
NameError: global name 'val' is not defined
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 323, in _process_tasks
raise EnTKError(ex)
EnTKError: global name 'val' is not defined
2019-01-25 10:21:50,135: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0001 with state SCHEDULED
2019-01-25 10:21:50,341: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of task.0001 to new state SCHEDULED successful
2019-01-25 10:21:50,662: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received stage.0001 with state SCHEDULED
2019-01-25 10:21:50,663: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Found parent pipeline: pipeline.0000
2019-01-25 10:21:50,914: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of stage.0001 to new state SCHEDULED successful
2019-01-25 10:22:04,156: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:22:04,157: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:22:04,483: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:22:04,483: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:22:34,291: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:22:34,292: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:22:34,473: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:22:34,474: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:23:04,425: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:23:04,426: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:23:04,753: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:23:04,753: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:23:34,560: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:23:34,560: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:23:34,724: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:23:34,724: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:24:04,694: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:24:04,695: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:24:04,980: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:24:04,980: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:24:34,828: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:24:34,829: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:24:35,250: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:24:35,251: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:25:04,962: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:25:04,964: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:25:05,209: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:25:05,210: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:25:35,099: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:25:35,101: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:25:35,471: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:25:35,471: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:26:05,234: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:26:05,235: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:26:05,460: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:26:05,460: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:26:35,367: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:26:35,368: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:26:35,738: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:26:35,739: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:27:05,501: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:27:05,502: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:27:05,708: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:27:05,708: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:27:35,637: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:27:35,638: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:27:35,980: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-25 10:27:35,981: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
2019-01-25 10:28:05,772: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-25 10:28:05,774: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-25 10:28:05,965: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
^C2019-01-25 10:52:50,250: radical.entk.wfprocessor.0000: wfprocessor : MainThread : ERROR : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel wfprocessor process gracefully...
2019-01-25 10:52:50,250: radical.entk.appmanager.0000: MainProcess : MainThread : ERROR : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel enqueuer thread gracefully...
2019-01-25 10:52:50,250: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating WFprocessor
2019-01-25 10:52:50,250: radical.entk.task_manager.0000: task-manager : MainThread : ERROR : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel tmgr process gracefully...
2019-01-25 10:52:50,250: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Terminating enqueue-thread
2019-01-25 10:52:50,251: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Enqueue thread terminated
2019-01-25 10:52:50,538: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Terminating dequeue-thread
2019-01-25 10:52:50,593: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Terminated dequeue thread
2019-01-25 10:52:50,880: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : WFprocessor process terminated
Process wfprocessor:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 555, in _wfp
raise KeyboardInterrupt
KeyboardInterrupt
^CTraceback (most recent call last):
File "runme.py", line 533, in <module>
amgr.run()
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 456, in run
self._wfp.terminate_processor()
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 631, in terminate_processor
'Attempting to end WFprocessor... event: %s' % self._wfp_terminate.is_set())
File "/usr/lib/python2.7/multiprocessing/synchronize.py", line 306, in is_set
self._cond.acquire()
KeyboardInterrupt
Thanks Weiming! I'll take a look at all of this and get back to you.
New error at the saga level:
19-01-31 14:26:28,536: pmgr.0000.launching.0.child: pmgr.0000.launching.0 : MainThread : ERROR : bulk launc│···········································································································
h failed │···········································································································
Traceback (most recent call last): │···········································································································
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-pa│···········································································································
ckages/radical/pilot/pmgr/launching/default.py", line 488, in work │···········································································································
self._start_pilot_bulk(resource, schema, pilots) │···········································································································
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-pa│···········································································································
ckages/radical/pilot/pmgr/launching/default.py", line 697, in _start_pilot_bulk │···········································································································
if j.state == rs.FAILED: │···········································································································
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-pa│···········································································································
ckages/saga/job/job.py", line 623, in get_state │···········································································································
return self._adaptor.get_state (ttype=ttype) │···········································································································
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-pa│···········································································································
ckages/saga/adaptors/cpi/decorators.py", line 57, in wrap_function │···········································································································
return sync_function (self, *args, **kwargs) │···········································································································
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-pa│···········································································································
ckages/saga/adaptors/shell/shell_job.py", line 1411, in get_state │···········································································································
stats = self.js._job_get_stats (self._id) │···········································································································
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-pa│···········································································································
ckages/saga/adaptors/shell/shell_job.py", line 760, in _job_get_stats │···········································································································
% (id, ret, out)) │···········································································································
NoSuccess: failed to get job stats for '[fork://localhost/]-[30438.0]': (127)(STATS 30438.0 │···········································································································
/bin/sh: 5: STATS: not found
Hey @andre-merzky , do you have any idea about this error? We encountered this error when executing locally.
Full logs: re.session.sapphire.geog.psu.edu.wuh20.017927.0002.zip
The error was persistent after removing $HOME/.radical and $HOME/.saga as well.
Hey @vivek-bala : can you please send me the stack info? Is this reproducible on another machine? Please rerun it with RADICAL_SAGA_PTY_VERBOSE=DEBUG
and attach the resulting log file. Thanks!
Hey @Weiming-Hu , can you give this a try please. You can get the stack info by executing radical-stack
.
How would I get the stack info?
By running radical-stack
in the command line after sourcing your virtual environment.
I got the following error messages.
$ export RADICAL_SAGA_PTY_VERBOSE=DEBUG
$ python runme.py
2019-01-31 15:17:17,454: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-31 15:17:17,454: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : pid/tid: 6271/MainThread
2019-01-31 15:17:17,454: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Application Manager initialized
2019-01-31 15:17:17,646: radical.entk.task_processor: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-31 15:17:17,646: radical.entk.task_processor: MainProcess : MainThread : INFO : pid/tid: 6271/MainThread
2019-01-31 15:17:17,650: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-31 15:17:17,650: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : pid/tid: 6271/MainThread
2019-01-31 15:17:17,650: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Created resource manager object: resource_manager.0000
2019-01-31 15:17:17,650: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Resource description validated
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201801.nc Done!
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201802.nc Done!
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/observations/201801.nc Done!
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/observations/201802.nc Done!
Creating standard deviation task task-sd-calc-00000
Creating standard deviation task task-sd-calc-00001
Creating standard deviation task task-sd-calc-00002
Creating standard deviation task task-sd-calc-00003
Creating similarity task task-sims-calc-00000
Creating similarity task task-sims-calc-00001
Creating similarity task task-sims-calc-00002
Creating similarity task task-sims-calc-00003
Creating analog selection task task-analog-select-00000
Creating analog selection task task-analog-select-00001
Creating analog selection task task-analog-select-00002
Creating analog selection task task-analog-select-00003
2019-01-31 15:17:17,659: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Workflow assigned to Application Manager
Each stage only has one task. Run tasks in serial.
2019-01-31 15:17:19,661: radical.entk.wfprocessor.0000: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-31 15:17:19,662: radical.entk.wfprocessor.0000: MainProcess : MainThread : INFO : pid/tid: 6271/MainThread
2019-01-31 15:17:19,662: radical.entk.wfprocessor.0000: MainProcess : MainThread : INFO : Created WFProcessor object: wfprocessor.0000
2019-01-31 15:17:19,774: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Starting resource request submission
2019-01-31 15:17:22,206: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Pilot pilot.0000 state: PMGR_LAUNCHING_PENDING
2019-01-31 15:17:22,207: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Resource request submission successful.. waiting for pilot to go Active
2019-01-31 15:17:22,209: radical.entk.resource_manager.0000: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: PMGR_LAUNCHING
2019-01-31 15:17:26,792: radical.saga.cpi : pmgr.0000.launching.0 : MainThread : ERROR : flushing failed
Traceback (most recent call last):
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/saga/utils/pty_process.py", line 174, in flush
tmp = self.read (timeout=0.1)
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/saga/utils/pty_process.py", line 679, in read
% (e, self.tail))
NoSuccess: read from process failed '[Errno 5] Input/output error' : (wuh20@sapphire:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node$ $$PROMPT-0->PROMPT-0->PROMPT-0->PID: 6587
PROMPT-0->
RUNNING
OK
6696.0
PROMPT-0->
OK
BULK COMPLETED
PROMPT-0->
trapped EXIT
cmd_quit called (1)PROMPT-1->
) (/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/saga/utils/pty_process.py +679 (read) : % (e, self.tail)))
2019-01-31 15:17:26,795: radical.entk.resource_manager.0000: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: FAILED
2019-01-31 15:17:26,795: radical.entk.resource_manager.0000: MainProcess : pmgr.0000.subscriber._state_sub_cb: ERROR : Pilot has failed
2019-01-31 15:17:26,816: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Pilot is now active
2019-01-31 15:17:26,816: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Starting synchronizer thread
2019-01-31 15:17:26,817: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : synchronizer thread started
2019-01-31 15:17:26,817: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Starting WFProcessor process from AppManager
2019-01-31 15:17:26,819: radical.entk.wfprocessor.0000: MainProcess : MainThread : INFO : Starting WFprocessor process
2019-01-31 15:17:26,831: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : WFprocessor started
2019-01-31 15:17:26,832: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Starting dequeue-thread
2019-01-31 15:17:26,832: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Dequeue thread started
2019-01-31 15:17:26,833: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-01-31 15:17:26,833: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Starting enqueue-thread
2019-01-31 15:17:26,833: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : pid/tid: 6271/MainThread
2019-01-31 15:17:26,835: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : enqueue-thread started
2019-01-31 15:17:27,884: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received pipeline.0000 with state SCHEDULING
2019-01-31 15:17:27,885: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Found pipeline pipeline.0000, state SCHEDULING, completed False
2019-01-31 15:17:28,025: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of pipeline.0000 to new state SCHEDULING successful
2019-01-31 15:17:28,412: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received stage.0000 with state SCHEDULING
2019-01-31 15:17:28,412: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Found parent pipeline: pipeline.0000
2019-01-31 15:17:28,419: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : Created task manager object: task_manager.0000
2019-01-31 15:17:28,419: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Starting task manager process from AppManager
2019-01-31 15:17:28,420: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : Starting task manager process
2019-01-31 15:17:28,424: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : Starting heartbeat thread
2019-01-31 15:17:28,427: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating WFprocessor
2019-01-31 15:17:28,428: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Terminating enqueue-thread
2019-01-31 15:17:28,430: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Task Manager process started
2019-01-31 15:17:28,550: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of stage.0000 to new state SCHEDULING successful
2019-01-31 15:17:28,940: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0000 with state SCHEDULING
2019-01-31 15:17:29,075: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of task.0000 to new state SCHEDULING successful
2019-01-31 15:17:29,089: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-01-31 15:17:29,468: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0000 with state SCHEDULED
2019-01-31 15:17:29,629: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-01-31 15:17:29,638: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pymongo/topology.py:149: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
"MongoClient opened before fork. Create MongoClient only "
2019-01-31 15:17:29,730: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of task.0000 to new state SCHEDULED successful
2019-01-31 15:17:29,995: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received stage.0000 with state SCHEDULED
2019-01-31 15:17:29,995: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Found parent pipeline: pipeline.0000
2019-01-31 15:17:30,254: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of stage.0000 to new state SCHEDULED successful
2019-01-31 15:17:30,255: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Enqueue thread terminated
2019-01-31 15:17:30,518: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Terminating dequeue-thread
2019-01-31 15:17:30,559: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Terminated dequeue thread
2019-01-31 15:17:30,850: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating synchronizer thread
2019-01-31 15:17:31,178: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Synchronizer thread terminated
2019-01-31 15:17:31,178: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating task manager process
2019-01-31 15:17:59,221: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-01-31 15:17:59,222: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
^C2019-01-31 15:18:15,825: radical.entk.appmanager.0000: MainProcess : MainThread : ERROR : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel enqueuer thread gracefully...
2019-01-31 15:18:15,825: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating WFprocessor
2019-01-31 15:18:15,825: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating task manager process
2019-01-31 15:18:16,144: radical.saga.cpi : pmgr.0000.launching.0 : MainThread : ERROR : flushing failed
Traceback (most recent call last):
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/saga/utils/pty_process.py", line 174, in flush
tmp = self.read (timeout=0.1)
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/saga/utils/pty_process.py", line 679, in read
% (e, self.tail))
NoSuccess: read from process failed '[Errno 5] Input/output error' : (wuh20@sapphire:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node$ $$PROMPT-0->PROMPT-0->PROMPT-0->PID: 6587
PROMPT-0->
RUNNING
OK
6696.0
PROMPT-0->
OK
BULK COMPLETED
PROMPT-0->
trapped EXIT
cmd_quit called (1)PROMPT-1->
) (/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/saga/utils/pty_process.py +679 (read) : % (e, self.tail)))
^CTraceback (most recent call last):
File "runme.py", line 533, in <module>
amgr.run()
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 460, in run
self._task_manager.terminate_manager()
File "/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/radical/entk/execman/base/task_manager.py", line 300, in terminate_manager
self._tmgr_process.join()
File "/usr/lib/python2.7/multiprocessing/process.py", line 148, in join
res = self._popen.wait(timeout)
File "/usr/lib/python2.7/multiprocessing/forking.py", line 154, in wait
return self.poll(0)
File "/usr/lib/python2.7/multiprocessing/forking.py", line 135, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Thank you.
$ radical-stack
python : 2.7.15rc1
pythonpath :
virtualenv : /home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv
radical.entk : 0.7.11
radical.pilot : 0.50.21
radical.utils : 0.50.3
saga : 0.50.5
Thanks for the stack. I'm afraid I still don't see the log messages I am looking for. Could you please also set RADICAL_LOG_TGT=radical.log
to make sure those are actually recorded, and attach that file then, too? Thanks!
Hi. Sorry for my late update.
Below is the error message from console std:
$ python runme.py
2019-02-05 11:56:43,555: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-02-05 11:56:43,555: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : pid/tid: 31691/MainThread
2019-02-05 11:56:43,555: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Application Manager initialized
2019-02-05 11:56:43,862: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-02-05 11:56:43,863: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : pid/tid: 31691/MainThread
2019-02-05 11:56:43,863: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Created resource manager object: resource_manager.0000
2019-02-05 11:56:43,863: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Resource description validated
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201801.nc Done!
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/forecasts/201802.nc Done!
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/observations/201801.nc Done!
Processing /home/graduate/wuh20/exfat-hu/Data/2019_Hu_AnEn-bias-correction/observations/201802.nc Done!
Creating standard deviation task task-sd-calc-00000
Creating standard deviation task task-sd-calc-00001
Creating standard deviation task task-sd-calc-00002
Creating standard deviation task task-sd-calc-00003
Creating similarity task task-sims-calc-00000
Creating similarity task task-sims-calc-00001
Creating similarity task task-sims-calc-00002
Creating similarity task task-sims-calc-00003
Creating analog selection task task-analog-select-00000
Creating analog selection task task-analog-select-00001
Creating analog selection task task-analog-select-00002
Creating analog selection task task-analog-select-00003
2019-02-05 11:56:44,168: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Workflow assigned to Application Manager
Each stage only has one task. Run tasks in serial.
2019-02-05 11:56:46,774: radical.entk.wfprocessor.0000: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-02-05 11:56:46,775: radical.entk.wfprocessor.0000: MainProcess : MainThread : INFO : pid/tid: 31691/MainThread
2019-02-05 11:56:46,775: radical.entk.wfprocessor.0000: MainProcess : MainThread : INFO : Created WFProcessor object: wfprocessor.0000
2019-02-05 11:56:46,858: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Starting resource request submission
2019-02-05 11:56:50,219: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Pilot pilot.0000 state: PMGR_LAUNCHING_PENDING
2019-02-05 11:56:50,220: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Resource request submission successful.. waiting for pilot to go Active
2019-02-05 11:56:50,222: radical.entk.resource_manager.0000: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: PMGR_LAUNCHING
2019-02-05 11:56:54,355: radical.entk.resource_manager.0000: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: FAILED
2019-02-05 11:56:54,356: radical.entk.resource_manager.0000: MainProcess : pmgr.0000.subscriber._state_sub_cb: ERROR : Pilot has failed
2019-02-05 11:56:54,427: radical.entk.resource_manager.0000: MainProcess : MainThread : INFO : Pilot is now active
2019-02-05 11:56:54,428: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Starting synchronizer thread
2019-02-05 11:56:54,428: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : synchronizer thread started
2019-02-05 11:56:54,429: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Starting WFProcessor process from AppManager
2019-02-05 11:56:54,431: radical.entk.wfprocessor.0000: MainProcess : MainThread : INFO : Starting WFprocessor process
2019-02-05 11:56:54,441: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : WFprocessor started
2019-02-05 11:56:54,442: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Starting dequeue-thread
2019-02-05 11:56:54,442: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : python.interpreter version: 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
2019-02-05 11:56:54,443: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : pid/tid: 31691/MainThread
2019-02-05 11:56:54,443: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Dequeue thread started
2019-02-05 11:56:54,443: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Starting enqueue-thread
2019-02-05 11:56:54,445: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : enqueue-thread started
2019-02-05 11:56:55,584: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received pipeline.0000 with state SCHEDULING
2019-02-05 11:56:55,585: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Found pipeline pipeline.0000, state SCHEDULING, completed False
2019-02-05 11:56:55,741: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of pipeline.0000 to new state SCHEDULING successful
2019-02-05 11:56:55,906: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : Created task manager object: task_manager.0000
2019-02-05 11:56:55,906: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Starting task manager process from AppManager
2019-02-05 11:56:55,907: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : Starting task manager process
2019-02-05 11:56:55,911: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : Starting heartbeat thread
2019-02-05 11:56:55,914: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating WFprocessor
2019-02-05 11:56:55,916: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Terminating enqueue-thread
2019-02-05 11:56:55,916: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Task Manager process started
2019-02-05 11:56:56,157: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received stage.0000 with state SCHEDULING
2019-02-05 11:56:56,157: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Found parent pipeline: pipeline.0000
2019-02-05 11:56:56,313: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of stage.0000 to new state SCHEDULING successful
2019-02-05 11:56:56,636: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
2019-02-05 11:56:56,729: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0000 with state SCHEDULING
2019-02-05 11:56:56,885: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of task.0000 to new state SCHEDULING successful
2019-02-05 11:56:57,220: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Received heartbeat request
2019-02-05 11:56:57,221: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Sent heartbeat response
/home/graduate/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/local/lib/python2.7/site-packages/pymongo/topology.py:149: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: h
ttp://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
"MongoClient opened before fork. Create MongoClient only "
2019-02-05 11:56:57,301: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0000 with state SCHEDULED
2019-02-05 11:56:57,458: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of task.0000 to new state SCHEDULED successful
2019-02-05 11:56:57,873: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received stage.0000 with state SCHEDULED
2019-02-05 11:56:57,874: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Found parent pipeline: pipeline.0000
2019-02-05 11:56:58,030: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Transition of stage.0000 to new state SCHEDULED successful
2019-02-05 11:56:58,030: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Enqueue thread terminated
2019-02-05 11:56:58,317: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Terminating dequeue-thread
2019-02-05 11:56:58,398: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Terminated dequeue thread
2019-02-05 11:56:58,666: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating synchronizer thread
2019-02-05 11:56:58,730: radical.entk.appmanager.0000: MainProcess : synchronizer-thread: INFO : Received task.0000 with state SUBMITTING
2019-02-05 11:56:58,952: radical.entk.task_manager.0000: task-manager : Thread-1 : INFO : Transition of task.0000 to new state SUBMITTING successful
2019-02-05 11:56:59,159: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Synchronizer thread terminated
2019-02-05 11:56:59,159: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating task manager process
2019-02-05 11:57:26,780: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Received heartbeat response
2019-02-05 11:57:26,781: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
^C2019-02-05 13:07:40,012: radical.entk.appmanager.0000: MainProcess : MainThread : ERROR : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel enqueuer thread gracefully...
2019-02-05 13:07:40,017: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating WFprocessor
2019-02-05 13:07:40,017: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating task manager process
This is the log file.
Thank you
Related to #83
Taking this offline since we already have a working version from the commit f73b197.
I've observed consistent failures of EnTK on local machine. Here is the error message.
Since the sandbox files are large than 10 MB and I can't upload them here, I have sent the files to @vivek-bala's email.
Thank you very much!
Weiming