Closed euhruska closed 5 years ago
also, how do I fix this?
I am not sure where exactly the error is coming from. Can you link the script, specifically where you try to delete the record?
I deleted that directory manually. I wouldn't have done that if I knew it's important. The script is exactly the same as before. I just updated the rp environment. Should I create an empty record directory. I just don't want to lose the previous entk run with all the iterations already done.
Hmmm. This is coming from RP and I am not sure why that json file is required. Can you give me the version and/or branch of RP that causes this issue?
radical-stack
python : 2.7.14
pythonpath :
virtualenv : extasy11
radical.analytics : v0.45.2-102-gaec2e1d@devel
radical.entk : 0.7.6-0.7.6-1-g9a95cf49@fix-issue_extasy_96
radical.pilot : 0.50.8-v0.50.8-6-g6e36785f@devel
radical.utils : 0.50.1-v0.50.1-1-gce60836@devel
saga : 0.50.0-v0.50.0@devel
I tried to revert entk fix-issue_extasy_96, but that lead only to:
Traceback (most recent call last):
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 557, in _wfp
raise EnTKError(text=ex)
TypeError: __init__() got an unexpected keyword argument 'text'
Traceback (most recent call last):
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 503, in _wfp
self._enqueue_thread.join()
AttributeError: 'NoneType' object has no attribute 'join'
and
2018-10-05 18:56:50,703: radical.entk.task_manager.0000: MainProcess : heartbeat : ERROR : Heartbeat failed wit
h error: unsupported operand type(s) for +: 'float' and 'str'
Traceback (most recent call last):
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/base/task_manager.py", line 145, in _heartbeat
mq_connection.sleep(self._hb_interval)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 771, in sleep
deadline = time.time() + duration
TypeError: unsupported operand type(s) for +: 'float' and 'str'
radical-stack
python : 2.7.14
pythonpath :
virtualenv : extasy11
radical.analytics : v0.45.2-102-gaec2e1d@devel
radical.entk : 0.7.4-0.7.4@HEAD-detached-at-af273638
radical.pilot : 0.50.8-v0.50.8-6-g6e36785f@devel
radical.utils : 0.50.1-v0.50.1-1-gce60836@devel
saga : 0.50.0-v0.50.0@devel
Currently I can't run anything because of this
Please unset RADICAL_PILOT_RECORD and RADICAL_RECORD.
...
On Wed, Oct 3, 2018, 23:35 Eugen Hruska notifications@github.com wrote:
I deleted a directory called record and this error happened, is the record directory important?
2018-10-03 15:28:55,767: radical.entk.resource_manager.0000: MainProcess : MainThread : ERROR : Resource request submission failed 2018-10-03 15:28:55,767: radical.entk.appmanager.0000: MainProcess : MainThread : ERROR : Error in AppManager: [Errno 2] No such file or directory: 'record/pilot.0000.batch.000.json' Traceback (most recent call last): File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 310, in run self._resource_manager._submit_resource_request() File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 155, in _submit_resource_request self._pilot = self._pmgr.submit_pilots(pdesc) File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 514, in submit_pilots % (self._session._rec, pilot.uid, self._rec_id)) File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/utils/read_json.py", line 70, in write_json with open (filename, 'w') as f : IOError: [Errno 2] No such file or directory: 'record/pilot.0000.batch.000.json' 2018-10-03 15:28:55,769: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating WFprocessor 2018-10-03 15:28:55,770: radical.entk.wfprocessor.0000: MainProcess : MainThread : DEBUG : WFprocessor process already terminated Error: [Errno 2] No such file or directory: 'record/pilot.0000.batch.000.json' Traceback (most recent call last): File "extasy_tica3.py", line 322, in
appman.run() File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 310, in run self._resource_manager._submit_resource_request() File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 155, in _submit_resource_request self._pilot = self._pmgr.submit_pilots(pdesc) File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 514, in submit_pilots % (self._session._rec, pilot.uid, self._rec_id)) File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/utils/read_json.py", line 70, in write_json with open (filename, 'w') as f : IOError: [Errno 2] No such file or directory: 'record/pilot.0000.batch.000.json' — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/extasy-grlsd/issues/98, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQi-hH-Yc0BKTKAg29z7Sax3tKOP5_Oks5uhS2BgaJpZM4XG-Hu .
Eugen, I am not sure why you have to revert EnTK, can you elaborate on that please?
TypeError: unsupported operand type(s) for +: 'float' and 'str'
This error is the one fixed in fix/issue_extasy_96 in EnTK.
I reverted EnTK only because I didn't know how else to fix the record issue. I have now unset RADICAL_PILOT_RECORD_SESSION, I haven't set before RADICAL_PILOT_RECORD and RADICAL_RECORD, that helped but I get an error in remote pilot.0000/agent_0.err:
2018-10-07 16:43:59,120: radical.saga.cpi : MainProcess : MainThread : INFO : pid/tid: 4901/MainThread
2018-10-07 16:43:59,121: radical.saga.api : MainProcess : MainThread : INFO : default ssh key at /u/sciteam/hruska/.ssh/id_rsa
2018-10-07 16:43:59,144: radical.saga.cpi : MainProcess : MainThread : INFO : init SSH context for key at '/u/sciteam/hruska/.ssh/id_rsa' done
2018-10-07 16:43:59,145: radical.saga : MainProcess : MainThread : DEBUG : default context [saga.adaptor.ssh ] : {'LifeTime' : '-1', 'Type' : 'ssh', 'UserCert' : '/u/sciteam/hruska/.ssh/id_rsa.pub', 'UserKey' : '/u/sciteam/hruska/.ssh/id_rsa'}
2018-10-07 16:43:59,153: radical.saga : MainProcess : MainThread : WARNING : cannot initialize context - no session: {'LifeTime' : '-1', 'Type' : 'ssh', 'UserCert' : '/u/sciteam/hruska/.ssh/id_rsa.pub', 'UserKey' : '/u/sciteam/hruska/.ssh/id_rsa'}
Resource temporarily unavailable (src/thread.cpp:191)
Traceback (most recent call last):
File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/bin/radical-pilot-agent", line 71, in <module>
bootstrap_3(sys.argv[1])
File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/bin/radical-pilot-agent", line 42, in bootstrap_3
if agent_name == 'agent_0': agent = rpa.Agent_0(agent_name)
File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/agent_0.py", line 95, in __init__
session = rp_Session(cfg=session_cfg, uid=self._sid)
File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/session.py", line 271, in __init__
self._bridges = ruc.start_bridges (self._cfg, self, self._log)
File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 155, in start_bridges
bridge = rpu_Pubsub(session, bname, rpu_PUBSUB_BRIDGE, bcfg_clone)
File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/utils/pubsub.py", line 139, in __init__
self.start()
File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/utils/process.py", line 504, in start
(self._ru_childname, msg))
RuntimeError: child state.pubsub.bridge.0000.child failed to come up [s]
~
I have now unset RADICAL_PILOT_RECORD_SESSION, I haven't set before RADICAL_PILOT_RECORD and RADICAL_RECORD
I can't really tell how the recording code paths have been triggered without those variables being set, sorry, If you have a full set of log files for a session with that problem, I would like to have a look. Either way, I am glad this seems resolved.
that helped but I get an error in remote pilot.0000/agent_0.err:
What are the client and target machines for this run? Can you please provide a set of log files? Thanks!
The client is leonardo at Rice and target is bluewaters. I tried fetching the logfiles, but I got:
radical-pilot-fetch-logfiles re.session.leonardo.rice.edu.eh22.017811.0002/
2018-10-08 09:27:29,804: radical.pilot.utils : MainProcess : MainThread : INFO : python.interpreter version: 2.7.14 | packaged by conda-forge | (default, Mar 30 2018, 18:16:04) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
2018-10-08 09:27:29,804: radical.pilot.utils : MainProcess : MainThread : INFO : radical.pilot.utils version: 0.50.8-v0.50.8-6-g6e36785f@devel
2018-10-08 09:27:29,804: radical.pilot.utils : MainProcess : MainThread : INFO : pid/tid: 13106/MainThread
Traceback (most recent call last):
File "/scratch1/eh22/conda/envs/extasy11/bin/radical-pilot-fetch-logfiles", line 88, in <module>
rpu.fetch_logfiles(sid=sid, dburl=dburl, src=src, tgt=tgt, access=access, skip_existing=skip)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/utils/session.py", line 254, in fetch_logfiles
json_docs = get_session_docs(db, sid)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/utils/db_utils.py", line 88, in get_session_docs
raise ValueError ('no session %s in db (was `cleanup` disabled on `session.close()`?)' % sid)
ValueError: no session re.session.leonardo.rice.edu.eh22.017811.0002/ in db (was `cleanup` disabled on `session.close()`?)
Here the remote directory: re.session.leonardo.rice.edu.eh22.017811.0002.zip
Can you ensure that the RADICAL_PILOT_DBURL env variable was set in the same session where you are executing radical-pilot-fetch-logfiles
?
yes, $RADICAL_PILOT_DBURL was set
but now it worked, not sure what changed re.session.leonardo.rice.edu.eh22.017811.0002-client.zip
I am working on the child startup error on BW - this problem came up for multiple users, and is likely related to a change in system configuration (see https://github.com/radical-collaboration/extasy-bpti/issues/9)
Any update?
Unfortunately only that fixing this will take a while - but I hope to get BW back to working during this week.
Can you please give the RP release 0.50.12 a try? Thanks
installed on client, no problem
on remote installed sh radical-pilot-create-static-ve ve.ncsa.bw_aprun.0.50.12 bw
but in agent_0.log it got stuck at:
2018-10-12 17:12:54,287: agent_0 : MainProcess : MainThread : INFO : get msg: alive
2018-10-12 17:12:54,287: agent_0 : MainProcess : MainThread : DEBUG : ru_initialize_common (NOOP)
2018-10-12 17:12:54,287: agent_0 : MainProcess : MainThread : DEBUG : ru_initialize_parent (NOOP)
2018-10-12 17:12:54,287: agent_0 : MainProcess : MainThread : DEBUG : child thread agent_0.idler._check_units_cb started
2018-10-12 17:12:54,288: agent_0 : MainProcess : MainThread : DEBUG : agent_0 registered idler agent_0.idler._check_units_cb
2018-10-12 17:12:54,288: agent_0 : MainProcess : agent_0.idler._check_units_cb: DEBUG : alive check in term
2018-10-12 17:12:54,288: agent_0 : MainProcess : MainThread : DEBUG : process class started (no child)
2018-10-12 17:12:54,315: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
2018-10-12 17:12:55,368: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
2018-10-12 17:12:56,396: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
2018-10-12 17:12:57,424: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
2018-10-12 17:12:58,454: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
2018-10-12 17:12:59,482: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
2018-10-12 17:13:00,510: agent_0 : MainProcess : agent_0.idler._check_units_cb: INFO : units pulled: 0
Here the remote logs: re.session.leonardo.rice.edu.eh22.017816.0002.zip
units pulled: 0
Indeed, the pilot is not able to pull any units. I suspect that either the units fail on the client side, or EnTK is not submitting any?
I have rerun the command and 3x times I got units pulled: 0
and 3x everything worked (units pulled and executed correctly, I've seen failures even after the first success). I literally didn't change anything, so that's weird. At least I got some more iterations.
on the client side I see units pulled: 0
in umgr.0000.log
client logs:
re.session.leonardo.rice.edu.eh22.017816.0002-client.zip
If the run worked, you will see at least one log entry with unit pulled: n
- but there will still be many pulls which do not get new units. The umgr entries show the units being pulled back from the agent to the client side.
What I could not parse from your last message: is the error completely gone right now?
No, the error is still happening at about 50% probability. I'm not sure what the reason is why in some cases it fails and some cases works.
Can you please capture log from client and pilot side for one of the failing runs? Thanks!
the two above logs are from the same failing run
Ah, right - but re.session.leonardo.rice.edu.eh22.017816.0002-client.zip
contains the pilot logs, not the client logs (despite the name - sorry I missed the name the first time...)
should be the correct one: re.session.leonardo.rice.edu.eh22.017816.0002-client2.zip
I see no units being submitted for this run, at all. I'm afraid I can't judge what causes this. @vivek-bala : any feedback?
Currently, I don't see any repeat of this "no units submitted" issue. if it happens again I will open a new issue.
I deleted a directory called record and this error happened, is the record directory important?