radical-cybertools / radical.entk

The RADICAL Ensemble Toolkit
28 stars 17 forks source link

Fatal error #134

Closed antonst closed 8 years ago

antonst commented 8 years ago

I tried to run get_started.py locally on a workflow machine by:

RADICAL_ENTK_VERBOSE=REPORT python get_started.py

and got this error:

Job waiting on queue...2016-05-13 15:51:50,560: radical.pilot       : MainProcess                     : PilotLauncherWorker-1: ERROR   : Using bootstrapper /home/antontre/myenv/lib/python2.7/site-packages/radical/pilot/bootstrapper/bootstrap_1.sh
Copying bootstrapper 'file://localhost/home/antontre/myenv/lib/python2.7/site-packages/radical/pilot/bootstrapper/bootstrap_1.sh' to agent sandbox (<saga.filesystem.directory.Directory object at 0x7f2a7eef94d0>).
Copying sdist 'file://localhost/home/antontre/myenv/lib/python2.7/site-packages/radical/utils/radical.utils-0.40.tar.gz' to sandbox (file://localhost/home/antontre/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.antontre.016934.0000-pilot.0000/).
Copying sdist 'file://localhost/home/antontre/myenv/lib/python2.7/site-packages/saga/saga-python-0.40.2.tar.gz' to sandbox (file://localhost/home/antontre/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.antontre.016934.0000-pilot.0000/).
Copying sdist 'file://localhost/home/antontre/myenv/lib/python2.7/site-packages/radical/pilot/controller/..//radical.pilot-0.40.2.tar.gz' to sandbox (file://localhost/home/antontre/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.antontre.016934.0000-pilot.0000/).
Writing agent configuration to file '/tmp/rp_agent_cfg_dirooWNz5/agent_0.cfg'.
Copying agent configuration file 'file://localhost/tmp/rp_agent_cfg_dirooWNz5/agent_0.cfg' to sandbox (file://localhost/home/antontre/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.antontre.016934.0000-pilot.0000/).
Pilot launching failed! (failed to run bootstrap: (127)(/bin/sh: /home/antontre/.saga/adaptors/shell_job/wrapper.sh: No such file or directory
) (/home/antontre/myenv/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py +605 (initialize)  :  raise saga.NoSuccess ("failed to run bootstrap: (%s)(%s)" % (ret, out))))
Traceback (most recent call last):
  File "/home/antontre/myenv/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 712, in run
    js = saga.job.Service(js_url, session=self._session)
  File "/home/antontre/myenv/lib/python2.7/site-packages/saga/job/service.py", line 115, in __init__
    url, session, ttype=_ttype)
  File "/home/antontre/myenv/lib/python2.7/site-packages/saga/base.py", line 101, in __init__
    self._init_task = self._adaptor.init_instance (adaptor_state, *args, **kwargs)
  File "/home/antontre/myenv/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 57, in wrap_function
    return sync_function (self, *args, **kwargs)
  File "/home/antontre/myenv/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py", line 507, in init_instance
    self.initialize ()
  File "/home/antontre/myenv/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py", line 605, in initialize
    raise saga.NoSuccess ("failed to run bootstrap: (%s)(%s)" % (ret, out))
NoSuccess: failed to run bootstrap: (127)(/bin/sh: /home/antontre/.saga/adaptors/shell_job/wrapper.sh: No such file or directory
) (/home/antontre/myenv/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py +605 (initialize)  :  raise saga.NoSuccess ("failed to run bootstrap: (%s)(%s)" % (ret, out)))
2016-05-13 15:51:50,837: radical.entk.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Resource error: 
2016-05-13 15:51:50,838: radical.entk.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Pattern execution FAILED.
2016-05-13 15:51:50,838: radical.pilot       : MainProcess                     : Thread-1       : ERROR   : sys.exit from callback
Traceback (most recent call last):
  File "/home/antontre/myenv/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 258, in call_callbacks
    cb(self._shared_data[pilot_id]['facade_object'](), new_state)
  File "/home/antontre/myenv/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 168, in pilot_state_cb
SystemExit: 1
2016-05-13 15:51:51,323: radical.entk.SingleClusterEnvironment: MainProcess                     : MainThread     : ERROR   : Fatal error during execution: .
Fatal error during execution: .
Starting Deallocation..
2016-05-13 15:51:51,324: radical.entk.SingleClusterEnvironment: MainProcess                     : MainThread     : ERROR   : Fatal error during execution: .
Fatal error: .  File "/home/antontre/myenv/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 312, in run
    plugin.execute_pattern(pattern, self)
  File "/home/antontre/myenv/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/bag_of_tasks/static.py", line 100, in execute_pattern
  File "/home/antontre/myenv/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 532, in wait_pilots
antonst commented 8 years ago

btw, the second attempt succeeded

antonst commented 8 years ago

slightly off topic, but is it possible to get rid of the:

2016-05-13 16:03:09,738: radical.pilot       : MainProcess                     : Thread-1       : ERROR   : Couldn't call manager callback (no pilot instance)

When it pops up in the middle of a print messages it is quite confusing for the usee:

 EnsembleMD (0.4-RC0)                                                           

Starting Allocation2016-05-13 16:03:09,738: radical.pilot       : MainProcess                     : Thread-1       : ERROR   : Couldn't call manager callback (no pilot instance)
Verifying pattern                                                             ok
Starting pattern execution                                                    ok
Executing 1 instances of 1 stages on 1 allocated core(s) on 'xsede.stampede'    

Job waiting on queue...
Job is now running !
Waiting for stage_1 to complete.                                            done
Pattern execution successfully finished                                         

Starting Deallocation..
Resource allocation cancelled.                                              done 
vivek-bala commented 8 years ago

This looks like its coming from Pilot layer (or possibly SAGA):

Copying agent configuration file 'file://localhost/tmp/rp_agent_cfg_dirooWNz5/agent_0.cfg' to sandbox (file://localhost/home/antontre/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.antontre.016934.0000-pilot.0000/).
failed to run bootstrap: (127)(/bin/sh: /home/antontre/.saga/adaptors/shell_job/wrapper.sh: No such file or directory

I'll ping Andre regarding this once he is online.

vivek-bala commented 8 years ago

That message shouldn't actually come if you are setting verbosity to REPORT. Wait, that's an RP message. Do you have any value set for RADICAL_PILOT_VERBOSE ?

antonst commented 8 years ago

Do you have any value set for RADICAL_PILOT_VERBOSE ?

no, I don't, RADICAL_PILOT_VERBOSE is set to None

andre-merzky commented 8 years ago

re wrapper problem: yeah, that appeared a couple of times already. Seems like a saga change is not as backward compatible as we thought. What helps is to run rm -rf ~/.saga on the target machine.

error log: the reporter is letting all messages with ERROR level through. The only fix would thus be to lower the log level on that message. I'll do that if I we happen to push a new release, otherwise that's something we'll have to live with...

vivek-bala commented 8 years ago

Thanks Andre. I'm closing the ticket since its a known issue. Please let me know if there isn't a ticket for this in saga/rp, i'll create one.