radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

experienced hanging #132

Closed andre-merzky closed 10 years ago

andre-merzky commented 10 years ago

The code was a simple test, but it never got far in the first place, so the script won't matter) hang for about 5 min. After manual abortion, I saw the following stack info:

2014:05:05 12:46:59 MainThread   saga.DefaultSession   : [DEBUG   ] default context [saga.adaptor.ssh    ] : {'Type' : 'ssh', 'UserCert' : '/home/merzky/.ssh/id_rsa.pub', 'UserKey' : '/home/merzky/.ssh/id_rsa'}

^CTraceback (most recent call last):
  File "shared_input_minimal.py", line 45, in <module>
    pilot = pmgr.submit_pilots(pdesc)
  File "/home/merzky/saga/saga-python/ve/local/lib/python2.7/site-packages/radical.pilot-0.11-py2.7.egg/radical/pilot/pilot_manager.py", line 314, in submit_pilots
Process PoolWorker-1:
Process PoolWorker-2:
Traceback (most recent call last):
    session=self._session)
Traceback (most recent call last):
  File "/home/merzky/saga/saga-python/ve/local/lib/python2.7/site-packages/radical.pilot-0.11-py2.7.egg/radical/pilot/controller/pilot_manager_controller.py", line 313, in register_start_pilot_request
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    out, err = p.communicate()
  File "/usr/lib/python2.7/subprocess.py", line 798, in communicate
    return self._communicate(input)
  File "/usr/lib/python2.7/subprocess.py", line 1400, in _communicate
    stdout, stderr = self._communicate_with_poll(input)
  File "/usr/lib/python2.7/subprocess.py", line 1454, in _communicate_with_poll
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    ready = poller.poll()
KeyboardInterrupt
    task = get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
    task = get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
    return recv()
KeyboardInterrupt
    racquire()
KeyboardInterrupt

This was a one-off, subsequent runs worked as expected, so I am not sure what to make of it...

oweidner commented 10 years ago

Hi Andre, can you add your test script as tests/issue_132.py please? Thanks!

andre-merzky commented 10 years ago

The script ran until here:

        DBURL = os.getenv("RADICALPILOT_DBURL")
        RCONF = ["https://raw.github.com/radical-cybertools/radical.pilot/master/configs/xsede.json",
                 "https://raw.github.com/radical-cybertools/radical.pilot/master/configs/futuregrid.json"]

        session = radical.pilot.Session(database_url=DBURL)
        pmgr = radical.pilot.PilotManager(session=session, resource_configurations=RCONF)
        pdesc = radical.pilot.ComputePilotDescription()
        pdesc.resource = REMOTE_HOST
        pdesc.runtime = 15 # minutes
        pdesc.cores = 8

The remainder of the used script won't run, as it needs a different branch and setup than RP proper.

oleweidner commented 10 years ago

You are using an old release of radical.pilot ("/home/merzky/saga/saga-python/ve/local/lib/python2.7/site-packages/radical.pilot-0.11-py2.7.egg/radical/pilot/pilot_manager.py") can you please test with the latest devel branch?

andre-merzky commented 10 years ago

The ticket is a months old, thus the old version. It was a one-off I could not reproduce, so testing again will not help. I posted it in case you see something in the stacktrace which I don't...

oleweidner commented 10 years ago

doesn't seem to pop up in 'devel' branch anymore.