radical-cybertools / radical.saga

A Light-Weight Access Layer for Distributed Computing Infrastructure and Reference Implementation of the SAGA Python Language Bindings.
http://radical-cybertools.github.io/saga-python/
Other
83 stars 34 forks source link

failed to run bootstrap #472

Open Francis-Liu opened 9 years ago

Francis-Liu commented 9 years ago

This is the error:

Traceback (most recent call last):
  File "/home/grad03/fengl/mypyenv2.sc2014.demo/local/lib/python2.7/site-packages/aimes/bundle/agent/bundle_agent.py", line 826, in create
    dbs=dbs)
  File "/home/grad03/fengl/mypyenv2.sc2014.demo/local/lib/python2.7/site-packages/aimes/bundle/agent/bundle_agent.py", line 544, in __init__
    super(SlurmAgent, self).__init__(resource_config, dbs)
  File "/home/grad03/fengl/mypyenv2.sc2014.demo/local/lib/python2.7/site-packages/aimes/bundle/agent/bundle_agent.py", line 36, in __init__
    self.start_bw_server()
  File "/home/grad03/fengl/mypyenv2.sc2014.demo/local/lib/python2.7/site-packages/aimes/bundle/agent/bundle_agent.py", line 174, in start_bw_server
    js = saga.job.Service(REMOTE_JOB_ENDPOINT, session=session)
  File "/home/grad03/fengl/mypyenv2.sc2014.demo/local/lib/python2.7/site-packages/saga/job/service.py", line 115, in __init__
    url, session, ttype=_ttype)
  File "/home/grad03/fengl/mypyenv2.sc2014.demo/local/lib/python2.7/site-packages/saga/base.py", line 101, in __init__
    self._init_task = self._adaptor.init_instance (adaptor_state, *args, **kwargs)
  File "/home/grad03/fengl/mypyenv2.sc2014.demo/local/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 57, in wrap_function
    return sync_function (self, *args, **kwargs)
  File "/home/grad03/fengl/mypyenv2.sc2014.demo/local/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py", line 469, in init_instance
    self.initialize ()
  File "/home/grad03/fengl/mypyenv2.sc2014.demo/local/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py", line 567, in initialize
    raise saga.NoSuccess ("failed to run bootstrap: (%s)(%s)" % (ret, out))
NoSuccess: failed to run bootstrap: (127)(/bin/sh: /home1/02768/liux2102/.saga/adaptors/shell_job/wrapper.sh: No such file or directory
) (/home/grad03/fengl/mypyenv2.sc2014.demo/local/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py +567 (initialize)  :  raise saga.NoSuccess ("failed to run bootstrap: (%s)(%s)" % (ret, out)))

This is my code:

    def start_bw_server(self):
        """Run an iperf program in server mode.
        """
        REMOTE_JOB_ENDPOINT  = "ssh://"  + self._login_server
        REMOTE_DIR = "sftp://" + self._login_server + "/tmp/aimes.bundle/iperf/"

        ctx = saga.Context("ssh")
        ctx.user_id = self._username

        session = saga.Session()
        session.add_context(ctx)

        workdir   = saga.filesystem.Directory(REMOTE_DIR, saga.filesystem.CREATE_PARENTS, session=session)
        mbwrapper = saga.filesystem.File(
                'file://localhost/%s/start-iperf-server-daemon.sh' % os.path.dirname(__file__))
        mbwrapper.copy(workdir.get_url())
        mbexe     = saga.filesystem.File(
                'file://localhost/%s/../third_party/iperf-3.0.11-source.tar.gz' % os.path.dirname(__file__))
        mbexe.copy(workdir.get_url())

        js = saga.job.Service(REMOTE_JOB_ENDPOINT, session=session)
        jd = saga.job.Description()

        jd.executable        = "./start-iperf-server-daemon.sh"
        iperf_local_port     = 55201
        jd.arguments         = [iperf_local_port]

        print "not returning?"
        myjob = js.create_job(jd)
        print "not returning?"
        myjob.run()
Francis-Liu commented 9 years ago

I was running this code to launch a script on stampede.

andre-merzky commented 8 years ago

Hi Francis,

sorry for the late reply on this: if this problem still persists, please rerun with SAGA_VERBOSE=DEBUG and post the resulting output (or feel free to send by mail, its probably long...)

Thanks!

jcohen02 commented 8 years ago

Just to confirm that I'm also intermittently seeing this exact same issue (running v0.40.1).

  File "/.../my_class.py", line 89, in _create_service
    service = Service(service_url, session=session)
  File "/.../saga/job/service.py", line 115, in __init__
    url, session, ttype=_ttype)
  File "/.../saga/base.py", line 101, in __init__
    self._init_task = self._adaptor.init_instance (adaptor_state, *args, **kwargs)
  File "/.../saga/adaptors/cpi/decorators.py", line 57, in wrap_function
    return sync_function (self, *args, **kwargs)
  File "/.../saga/adaptors/shell/shell_job.py", line 510, in init_instance
    self.initialize ()
  File "/.../saga/adaptors/shell/shell_job.py", line 608, in initialize
    raise saga.NoSuccess ("failed to run bootstrap: (%s)(%s)" % (ret, out))
NoSuccess: failed to run bootstrap: (127)(/bin/sh: /.../.saga/adaptors/shell_job/wrapper.sh: No such file or directory
) (/.../saga/adaptors/shell/shell_job.py +608 (initialize)  :  raise saga.NoSuccess ("failed to run bootstrap: (%s)(%s)" % (ret, out)))

It happens infrequently when creating an instance of saga.job.Service, I can't find a way to reproduce it at present and unfortunately didn't have debug switched on when I got this error a short while ago but I'll update this if I can provide further info.

andre-merzky commented 8 years ago

Hey - a debug log would be great indeed. I have not yet seen this issue popping up.

If it happens again, would you please also include the output of ls -la $HOME/.saga/adaptors/shell_job/, please (before any other job service instance is created)? Is your code using one or more job service(s) concurrently?

Thanks, Andre.

andre-merzky commented 8 years ago

Ah, I forgot to ask: when you say exact same issue, does that also mean toward Stampede, or are you targeting a different resource? thanks!

jcohen02 commented 8 years ago

When I said the "exact same issue" I meant the same error occurring at the same point in the session initialisation code - aside from the different line numbers, presumably due to other changes in the more recent release that I'm using, the stack track appears to be the same.

The specific error log posted above was from attempting to run a test job on localhost via SSH.

andre-merzky commented 8 years ago

thanks, got it!

vivek-bala commented 5 years ago

I think we occasionally still face this issue. I think the suggestion has been to remove $HOME/.saga right? @andre-merzky are there any ideas on how this can be addressed in the current version or v2?

andre-merzky commented 5 years ago

Removing that dir gets you running again, indeed. This needs a proper investigation and fix though. I did not manage to find the underlying cause - its likely a time consuming effort to fix, but needs to be done eventually IMHO.

jcohen02 commented 5 years ago

I took another look at this and have managed to recreate the problem (although not consistently) with DEBUG output enabled. I've provided @andre-merzky with some data to try and see if it's possible to identify what the cause of this might be.

vivek-bala commented 5 years ago

Thanks @jcohen02 !

andre-merzky commented 5 years ago

Thanks a lot for the provided log files and analysis, Jeremy - I'll dig through this, lets see if we can nail this one down after all that time :-)