hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

No active DRMAA session in segway #65

Open EricR86 opened 8 years ago

EricR86 commented 8 years ago

Original report (BitBucket issue) by Rachel Chan (Bitbucket: rcwchan).


I received the following error while trying to run segway on recover (note: this happened in every thread):

#!python

Exception in thread Thread-5:

Traceback (most recent call last):

  File "/mnt/work1/software/python/2.7/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()

  File "/mnt/work1/users/home2/rachelc/segway/segway/run.py", line 440, in run
    self.result = self.runner.run_train_instance()

  File "/mnt/work1/users/home2/rachelc/segway/segway/run.py", line 2111, in run_train_instance
    round_index, kwargs)

  File "/mnt/work1/users/home2/rachelc/segway/segway/run.py", line 2118, in progress_train_instance
    self.run_train_round(self.instance_index, round_index, **kwargs)

  File "/mnt/work1/users/home2/rachelc/segway/segway/run.py", line 2070, in run_train_round
    restartable_jobs.wait()

  File "/mnt/work1/users/home2/rachelc/segway/segway/cluster/__init__.py", line 284, in wait
    job_info = session.wait(jobid, session.TIMEOUT_NO_WAIT)

  File "/mnt/work1/users/home2/rachelc/.local/lib/python2.7/site-packages/drmaa/session.py", line 471, in wait
    rusage)

  File "/mnt/work1/users/home2/rachelc/.local/lib/python2.7/site-packages/drmaa/helpers.py", line 299, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))

  File "/mnt/work1/users/home2/rachelc/.local/lib/python2.7/site-packages/drmaa/errors.py", line 151, in error_check
    raise _ERRORS[code - 1](error_string)

NoActiveSessionException: code 5: No active session

Also received this error:

#!python

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/mnt/work1/software/python/2.7/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/mnt/work1/users/home2/rachelc/segway/segway/run.py", line 440, in run
    self.result = self.runner.run_train_instance()
  File "/mnt/work1/users/home2/rachelc/segway/segway/run.py", line 2093, in run_train_instance
    self.make_instance_initial_results()
  File "/mnt/work1/users/home2/rachelc/segway/segway/run.py", line 2415, in make_instance_initial_results
    log_likelihood)
  File "/mnt/work1/users/home2/rachelc/segway/segway/run.py", line 2396, in recover_train_instance
    path(old_curr_params_filename).copy2(new_curr_params_filename)
  File "/mnt/work1/software/python/2.7/lib/python2.7/shutil.py", line 128, in copy2
    copyfile(src, dst)
  File "/mnt/work1/software/python/2.7/lib/python2.7/shutil.py", line 82, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: path('/mnt/work1/users/hoffmangroup/rachelc/2016/semisupervised_tests/20160505_1409/results/20160513-1706/K562_5_Track.partial.traindir/params/params.0.params.79')

I checked, and params.0.params.79 did not originally exist in partial.traindir/. The initial job was killed using qdel (since it was qsub'd) and was probably in the midst of round 79 or so at the time.

EricR86 commented 8 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).


EricR86 commented 8 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).


EricR86 commented 8 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).


EricR86 commented 8 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).


EricR86 commented 8 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


My current theory about this error is from run_train_multithread:

#!python

        with Session() as session:
            try:
                for instance_index, instance_features in enumerator:
            ...
                    thread = TrainThread(self, session, instance_index,
                                         num_seg)
                    thread.start()
                    threads.append(thread)
            ...

            except KeyboardInterrupt:
                ... 
                for thread in threads:
                    thread.join()

                raise

It would seem that due to an error in one of the instances/threads where it couldn't find a file an exception was not caught. This would exit the context manager and essentially nullify the shared session object and not perform the graceful joining of the remaining threads.

Is there a reason why each thread doesn't contain it's own session object? Unsafe perhaps?

It looks like the error is caused by a failed recover when trying to open a file that doesn't exist.