radical-collaboration / extasy-grlsd

Repository to hold the input data and scripts for the ExTASY gromacs-lsdmap work
1 stars 1 forks source link

record directory error #98

Closed euhruska closed 5 years ago

euhruska commented 5 years ago

I deleted a directory called record and this error happened, is the record directory important?

2018-10-03 15:28:55,767: radical.entk.resource_manager.0000: MainProcess                     : MainThread     : ERROR   : Resource request submission failed
2018-10-03 15:28:55,767: radical.entk.appmanager.0000: MainProcess                     : MainThread     : ERROR   : Error in AppManager: [Errno 2] No such file or directory: 'record/pilot.0000.batch.000.json'
Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 310, in run
    self._resource_manager._submit_resource_request()
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 155, in _submit_resource_request
    self._pilot = self._pmgr.submit_pilots(pdesc)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 514, in submit_pilots
    % (self._session._rec, pilot.uid, self._rec_id))
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/utils/read_json.py", line 70, in write_json
    with open (filename, 'w') as f :
IOError: [Errno 2] No such file or directory: 'record/pilot.0000.batch.000.json'
2018-10-03 15:28:55,769: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating WFprocessor
2018-10-03 15:28:55,770: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : DEBUG   : WFprocessor process already terminated
Error: [Errno 2] No such file or directory: 'record/pilot.0000.batch.000.json'
Traceback (most recent call last):
  File "extasy_tica3.py", line 322, in <module>
    appman.run()
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 310, in run
    self._resource_manager._submit_resource_request()
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 155, in _submit_resource_request
    self._pilot = self._pmgr.submit_pilots(pdesc)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 514, in submit_pilots
    % (self._session._rec, pilot.uid, self._rec_id))
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/utils/read_json.py", line 70, in write_json
    with open (filename, 'w') as f :
IOError: [Errno 2] No such file or directory: 'record/pilot.0000.batch.000.json'
euhruska commented 5 years ago

also, how do I fix this?

vivek-bala commented 5 years ago

I am not sure where exactly the error is coming from. Can you link the script, specifically where you try to delete the record?

euhruska commented 5 years ago

I deleted that directory manually. I wouldn't have done that if I knew it's important. The script is exactly the same as before. I just updated the rp environment. Should I create an empty record directory. I just don't want to lose the previous entk run with all the iterations already done.

vivek-bala commented 5 years ago

Hmmm. This is coming from RP and I am not sure why that json file is required. Can you give me the version and/or branch of RP that causes this issue?

euhruska commented 5 years ago
radical-stack
  python               : 2.7.14
  pythonpath           :
  virtualenv           : extasy11

  radical.analytics    : v0.45.2-102-gaec2e1d@devel
  radical.entk         : 0.7.6-0.7.6-1-g9a95cf49@fix-issue_extasy_96
  radical.pilot        : 0.50.8-v0.50.8-6-g6e36785f@devel
  radical.utils        : 0.50.1-v0.50.1-1-gce60836@devel
  saga                 : 0.50.0-v0.50.0@devel
euhruska commented 5 years ago

I tried to revert entk fix-issue_extasy_96, but that lead only to:

Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 557, in _wfp
    raise EnTKError(text=ex)
TypeError: __init__() got an unexpected keyword argument 'text'
Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 503, in _wfp
    self._enqueue_thread.join()
AttributeError: 'NoneType' object has no attribute 'join'
euhruska commented 5 years ago

and

2018-10-05 18:56:50,703: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : ERROR   : Heartbeat failed wit
h error: unsupported operand type(s) for +: 'float' and 'str'
Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/base/task_manager.py", line 145, in _heartbeat
    mq_connection.sleep(self._hb_interval)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 771, in sleep
    deadline = time.time() + duration
TypeError: unsupported operand type(s) for +: 'float' and 'str'

radical-stack

  python               : 2.7.14
  pythonpath           :
  virtualenv           : extasy11

  radical.analytics    : v0.45.2-102-gaec2e1d@devel
  radical.entk         : 0.7.4-0.7.4@HEAD-detached-at-af273638
  radical.pilot        : 0.50.8-v0.50.8-6-g6e36785f@devel
  radical.utils        : 0.50.1-v0.50.1-1-gce60836@devel
  saga                 : 0.50.0-v0.50.0@devel

Currently I can't run anything because of this

andre-merzky commented 5 years ago

Please unset RADICAL_PILOT_RECORD and RADICAL_RECORD.

...

On Wed, Oct 3, 2018, 23:35 Eugen Hruska notifications@github.com wrote:

I deleted a directory called record and this error happened, is the record directory important?

2018-10-03 15:28:55,767: radical.entk.resource_manager.0000: MainProcess : MainThread : ERROR : Resource request submission failed 2018-10-03 15:28:55,767: radical.entk.appmanager.0000: MainProcess : MainThread : ERROR : Error in AppManager: [Errno 2] No such file or directory: 'record/pilot.0000.batch.000.json' Traceback (most recent call last): File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 310, in run self._resource_manager._submit_resource_request() File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 155, in _submit_resource_request self._pilot = self._pmgr.submit_pilots(pdesc) File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 514, in submit_pilots % (self._session._rec, pilot.uid, self._rec_id)) File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/utils/read_json.py", line 70, in write_json with open (filename, 'w') as f : IOError: [Errno 2] No such file or directory: 'record/pilot.0000.batch.000.json' 2018-10-03 15:28:55,769: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating WFprocessor 2018-10-03 15:28:55,770: radical.entk.wfprocessor.0000: MainProcess : MainThread : DEBUG : WFprocessor process already terminated Error: [Errno 2] No such file or directory: 'record/pilot.0000.batch.000.json' Traceback (most recent call last): File "extasy_tica3.py", line 322, in appman.run() File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 310, in run self._resource_manager._submit_resource_request() File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 155, in _submit_resource_request self._pilot = self._pmgr.submit_pilots(pdesc) File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 514, in submit_pilots % (self._session._rec, pilot.uid, self._rec_id)) File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/utils/read_json.py", line 70, in write_json with open (filename, 'w') as f : IOError: [Errno 2] No such file or directory: 'record/pilot.0000.batch.000.json'

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/extasy-grlsd/issues/98, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQi-hH-Yc0BKTKAg29z7Sax3tKOP5_Oks5uhS2BgaJpZM4XG-Hu .

vivek-bala commented 5 years ago

Eugen, I am not sure why you have to revert EnTK, can you elaborate on that please?

TypeError: unsupported operand type(s) for +: 'float' and 'str'

This error is the one fixed in fix/issue_extasy_96 in EnTK.

euhruska commented 5 years ago

I reverted EnTK only because I didn't know how else to fix the record issue. I have now unset RADICAL_PILOT_RECORD_SESSION, I haven't set before RADICAL_PILOT_RECORD and RADICAL_RECORD, that helped but I get an error in remote pilot.0000/agent_0.err:

2018-10-07 16:43:59,120: radical.saga.cpi    : MainProcess                     : MainThread     : INFO    :                      pid/tid: 4901/MainThread
2018-10-07 16:43:59,121: radical.saga.api    : MainProcess                     : MainThread     : INFO    : default ssh key at /u/sciteam/hruska/.ssh/id_rsa
2018-10-07 16:43:59,144: radical.saga.cpi    : MainProcess                     : MainThread     : INFO    : init SSH context for key  at '/u/sciteam/hruska/.ssh/id_rsa' done
2018-10-07 16:43:59,145: radical.saga        : MainProcess                     : MainThread     : DEBUG   : default context [saga.adaptor.ssh    ] : {'LifeTime' : '-1', 'Type' : 'ssh', 'UserCert' : '/u/sciteam/hruska/.ssh/id_rsa.pub', 'UserKey' : '/u/sciteam/hruska/.ssh/id_rsa'}
2018-10-07 16:43:59,153: radical.saga        : MainProcess                     : MainThread     : WARNING : cannot initialize context - no session: {'LifeTime' : '-1', 'Type' : 'ssh', 'UserCert' : '/u/sciteam/hruska/.ssh/id_rsa.pub', 'UserKey' : '/u/sciteam/hruska/.ssh/id_rsa'}
Resource temporarily unavailable (src/thread.cpp:191)
Traceback (most recent call last):
  File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/bin/radical-pilot-agent", line 71, in <module>
    bootstrap_3(sys.argv[1])
  File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/bin/radical-pilot-agent", line 42, in bootstrap_3
    if agent_name == 'agent_0': agent = rpa.Agent_0(agent_name)
  File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/agent_0.py", line 95, in __init__
    session = rp_Session(cfg=session_cfg, uid=self._sid)
  File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/session.py", line 271, in __init__
    self._bridges    = ruc.start_bridges   (self._cfg, self, self._log)
  File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 155, in start_bridges
    bridge = rpu_Pubsub(session, bname, rpu_PUBSUB_BRIDGE, bcfg_clone)
  File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/utils/pubsub.py", line 139, in __init__
    self.start()
  File "/scratch/sciteam/hruska/radical.pilot.sandbox/re.session.leonardo.rice.edu.eh22.017811.0000/pilot.0000/rp_install/lib/python2.7/site-packages/radical/utils/process.py", line 504, in start
    (self._ru_childname, msg))
RuntimeError: child state.pubsub.bridge.0000.child failed to come up [s]
~
andre-merzky commented 5 years ago

I have now unset RADICAL_PILOT_RECORD_SESSION, I haven't set before RADICAL_PILOT_RECORD and RADICAL_RECORD

I can't really tell how the recording code paths have been triggered without those variables being set, sorry, If you have a full set of log files for a session with that problem, I would like to have a look. Either way, I am glad this seems resolved.

that helped but I get an error in remote pilot.0000/agent_0.err:

What are the client and target machines for this run? Can you please provide a set of log files? Thanks!

euhruska commented 5 years ago

The client is leonardo at Rice and target is bluewaters. I tried fetching the logfiles, but I got:

radical-pilot-fetch-logfiles re.session.leonardo.rice.edu.eh22.017811.0002/
2018-10-08 09:27:29,804: radical.pilot.utils : MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.14 | packaged by conda-forge | (default, Mar 30 2018, 18:16:04) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
2018-10-08 09:27:29,804: radical.pilot.utils : MainProcess                     : MainThread     : INFO    : radical.pilot.utils  version: 0.50.8-v0.50.8-6-g6e36785f@devel
2018-10-08 09:27:29,804: radical.pilot.utils : MainProcess                     : MainThread     : INFO    :                      pid/tid: 13106/MainThread
Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy11/bin/radical-pilot-fetch-logfiles", line 88, in <module>
    rpu.fetch_logfiles(sid=sid, dburl=dburl, src=src, tgt=tgt, access=access, skip_existing=skip)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/utils/session.py", line 254, in fetch_logfiles
    json_docs = get_session_docs(db, sid)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/utils/db_utils.py", line 88, in get_session_docs
    raise ValueError ('no session %s in db (was `cleanup` disabled on `session.close()`?)' % sid)
ValueError: no session re.session.leonardo.rice.edu.eh22.017811.0002/ in db (was `cleanup` disabled on `session.close()`?)

Here the remote directory: re.session.leonardo.rice.edu.eh22.017811.0002.zip

vivek-bala commented 5 years ago

Can you ensure that the RADICAL_PILOT_DBURL env variable was set in the same session where you are executing radical-pilot-fetch-logfiles?

euhruska commented 5 years ago

yes, $RADICAL_PILOT_DBURL was set

euhruska commented 5 years ago

but now it worked, not sure what changed re.session.leonardo.rice.edu.eh22.017811.0002-client.zip

andre-merzky commented 5 years ago

I am working on the child startup error on BW - this problem came up for multiple users, and is likely related to a change in system configuration (see https://github.com/radical-collaboration/extasy-bpti/issues/9)

euhruska commented 5 years ago

Any update?

andre-merzky commented 5 years ago

Unfortunately only that fixing this will take a while - but I hope to get BW back to working during this week.

andre-merzky commented 5 years ago

Can you please give the RP release 0.50.12 a try? Thanks

euhruska commented 5 years ago

installed on client, no problem on remote installed sh radical-pilot-create-static-ve ve.ncsa.bw_aprun.0.50.12 bw but in agent_0.log it got stuck at:

2018-10-12 17:12:54,287: agent_0             : MainProcess                     : MainThread     : INFO    : get msg: alive
2018-10-12 17:12:54,287: agent_0             : MainProcess                     : MainThread     : DEBUG   : ru_initialize_common (NOOP)
2018-10-12 17:12:54,287: agent_0             : MainProcess                     : MainThread     : DEBUG   : ru_initialize_parent (NOOP)
2018-10-12 17:12:54,287: agent_0             : MainProcess                     : MainThread     : DEBUG   : child thread agent_0.idler._check_units_cb started
2018-10-12 17:12:54,288: agent_0             : MainProcess                     : MainThread     : DEBUG   : agent_0 registered idler agent_0.idler._check_units_cb
2018-10-12 17:12:54,288: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: DEBUG   : alive check in term
2018-10-12 17:12:54,288: agent_0             : MainProcess                     : MainThread     : DEBUG   : process class started (no child)
2018-10-12 17:12:54,315: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
2018-10-12 17:12:55,368: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
2018-10-12 17:12:56,396: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
2018-10-12 17:12:57,424: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
2018-10-12 17:12:58,454: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
2018-10-12 17:12:59,482: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
2018-10-12 17:13:00,510: agent_0             : MainProcess                     : agent_0.idler._check_units_cb: INFO    : units pulled:    0
euhruska commented 5 years ago

Here the remote logs: re.session.leonardo.rice.edu.eh22.017816.0002.zip

andre-merzky commented 5 years ago

units pulled: 0 Indeed, the pilot is not able to pull any units. I suspect that either the units fail on the client side, or EnTK is not submitting any?

euhruska commented 5 years ago

I have rerun the command and 3x times I got units pulled: 0 and 3x everything worked (units pulled and executed correctly, I've seen failures even after the first success). I literally didn't change anything, so that's weird. At least I got some more iterations.

on the client side I see units pulled: 0 in umgr.0000.log client logs: re.session.leonardo.rice.edu.eh22.017816.0002-client.zip

andre-merzky commented 5 years ago

If the run worked, you will see at least one log entry with unit pulled: n - but there will still be many pulls which do not get new units. The umgr entries show the units being pulled back from the agent to the client side.

What I could not parse from your last message: is the error completely gone right now?

euhruska commented 5 years ago

No, the error is still happening at about 50% probability. I'm not sure what the reason is why in some cases it fails and some cases works.

andre-merzky commented 5 years ago

Can you please capture log from client and pilot side for one of the failing runs? Thanks!

euhruska commented 5 years ago

the two above logs are from the same failing run

andre-merzky commented 5 years ago

Ah, right - but re.session.leonardo.rice.edu.eh22.017816.0002-client.zip contains the pilot logs, not the client logs (despite the name - sorry I missed the name the first time...)

euhruska commented 5 years ago

should be the correct one: re.session.leonardo.rice.edu.eh22.017816.0002-client2.zip

andre-merzky commented 5 years ago

I see no units being submitted for this run, at all. I'm afraid I can't judge what causes this. @vivek-bala : any feedback?

euhruska commented 5 years ago

Currently, I don't see any repeat of this "no units submitted" issue. if it happens again I will open a new issue.