hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

Upon bundling, segway tries to reference an accumulator file for a run that was never queued in minibatch #70

Closed EricR86 closed 8 years ago

EricR86 commented 8 years ago

Original report (BitBucket issue) by Rachel Chan (Bitbucket: rcwchan).


Upon bundling, segway tries to reference an accumulator file for a run that was never queued, in minibatch. I have reproduced this twice now. In both instances, segway attempted to look for an accumulator file that did not exist, and then errored out with the following:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/mnt/work1/software/python/2.7/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/mnt/work1/users/home2/rachelc/segway/segway/run.py", line 442, in run
    self.result = self.runner.run_train_instance()
  File "/mnt/work1/users/home2/rachelc/segway/segway/run.py", line 2113, in run_train_instance
    round_index, kwargs)
  File "/mnt/work1/users/home2/rachelc/segway/segway/run.py", line 2120, in progress_train_instance
    self.run_train_round(self.instance_index, round_index, **kwargs)
  File "/mnt/work1/users/home2/rachelc/segway/segway/run.py", line 2079, in run_train_round
    restartable_jobs.wait()
  File "/mnt/work1/users/home2/rachelc/segway/segway/cluster/__init__.py", line 289, in wait
    self.process_job(jobid, job_info)
  File "/mnt/work1/users/home2/rachelc/segway/segway/cluster/__init__.py", line 244, in process_job
    (jobid, job_name, error_filename))
RuntimeError: 
Submitted Job (5700692) failed. Failed Job: emt0.0.bundle.K562_5_Track.traindir.d112e1a6280111e6b7f15254004fdc09.
For details, check error messages in /mnt/work1/users/hoffmangroup/rachelc/2016/semisupervised_tests/20160530_1458/results/20160601-1004/K562_5_Track.traindir/output/e/0/emt0.0.bundle.K562_5_Track.traindir.d112e1a6280111e6b7f15254004fdc09.
See the Troubleshooting section of the Segway documentation.
Error: Can't open file (/mnt/work1/users/hoffmangroup/rachelc/2016/semisupervised_tests/20160530_1458/results/20160601-1004/K562_5_Track.traindir/accumulators/acc.0.755.bin) for reading.
Error: Can't open file (/mnt/work1/users/hoffmangroup/rachelc/2016/semisupervised_tests/20160530_1458/results/20160601-1004/K562_5_Track.traindir/accumulators/acc.0.755.bin) for reading.

acc.0.755.bin does not exist in the accumulators folder. According to the jobs.tab, the job for the 755 window was never run, and according to the train log, it was never queued (which means minibatch did not choose it).

Current theory is that minibatch does not choose this window, but for some reason, it gets chosen when the bundling job is run, and then segway cannot find it, so errors out.

EricR86 commented 8 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


EricR86 commented 8 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).


The arguments for loadAccRange are a comma separated list, and are not sorted for minibatch jobs. It looks like segway cuts off reading the list partway for some reason (despite the correct arguments being written to run.sh and details.sh). The issue is not argument length, as much longer arguments have been passed in other runs. It looks like this is causing a Range::Parse error I've been trying to solve and it's possible it's causing this error as well.

EricR86 commented 8 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).


Just confirmed that they are the same issue, as I suspected. The parameter passed to SGE is cut off partway through the list of windows. The number of numbers is not constant, but the number of characters without spaces appears to be constant, at 1024. Cutting off in the middle of the window list, say, "12,13,14", can cause either the Range::Parse error (ie, "12,13,") or this missing acc error (ie, "12,13,1" and 1 is not a valid window)

EricR86 commented 8 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).


This is caused by issue 70 (#72) and resolved in pull request #55.