Segway attempts to recover from last round run, rather than the one previous

hoffmangroup / segway

Application for semi-automated genomic annotation.

http://segway.hoffmanlab.org/

GNU General Public License v2.0

13 stars 7 forks source link

Segway attempts to recover from last round run, rather than the one previous #68

Open EricR86 opened 8 years ago

EricR86 commented 8 years ago

Original report (BitBucket issue) by Rachel Chan (Bitbucket: rcwchan).

After qdeling a segway job, and then recovering it, I found that segway attempted to recover from the last round run, rather than the one previous. This resulted in issues where it was looking for certain params files that did not yet exist. For instance, my job was qdel'd at round 79, and segway attempted to recover directly from round 79 (but should have recovered from round 78). It was looking for params.0.params.79 but that had not been written yet before the qdel, in the train directory it was recovering from.

I experimentally changed final_round_index in recover_train_instance in run.py to be len(log_likelihoods)-1 and tried again, and it recovered fine after that.

EricR86 commented 8 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).

You qdeled the Segway runner rather than the sub-jobs?

How is log_likelihood written for a round that never completed?

EricR86 commented 8 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).

Edited issue description

EricR86 commented 8 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).

We weren't sure which of the many child processes belonged to which segway run (I had 4 going on at the same time), so we qdel'ed the parent and let the children finish off. There isn't a way to send a SIGTERM signal specifically to a qsub'd job in SGE that I know of, it has to be qdel (the documentation is unclear if it is a SIGTERM or a SIGKILL).

I am not sure how the log_likelihoods file could be written for a round that never completed and would have to take a closer look at the code to figure out why. It is possible the round was about to finish when the job was qdel'ed, so the children technically ended up finishing it off, but the main segway process did not realise, and upon recovery, instead thought all files that were to be written for the entire round had already been written when they were not. But again, I would have to take a closer look at the code to figure out when/where these files are actually written, to know better.

EricR86 commented 8 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).

changed priority from "major" to "minor"

This might occur only under rare circumstances, let's leave this open but probably not prioritize it for now.