Open EricR86 opened 8 years ago
Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).
You qdel
ed the Segway runner rather than the sub-jobs?
How is log_likelihood
written for a round that never completed?
Original comment by Rachel Chan (Bitbucket: rcwchan).
We weren't sure which of the many child processes belonged to which segway run (I had 4 going on at the same time), so we qdel'ed the parent and let the children finish off. There isn't a way to send a SIGTERM signal specifically to a qsub'd job in SGE that I know of, it has to be qdel (the documentation is unclear if it is a SIGTERM or a SIGKILL).
I am not sure how the log_likelihoods file could be written for a round that never completed and would have to take a closer look at the code to figure out why. It is possible the round was about to finish when the job was qdel'ed, so the children technically ended up finishing it off, but the main segway process did not realise, and upon recovery, instead thought all files that were to be written for the entire round had already been written when they were not. But again, I would have to take a closer look at the code to figure out when/where these files are actually written, to know better.
Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).
This might occur only under rare circumstances, let's leave this open but probably not prioritize it for now.
Original report (BitBucket issue) by Rachel Chan (Bitbucket: rcwchan).
After qdeling a segway job, and then recovering it, I found that segway attempted to recover from the last round run, rather than the one previous. This resulted in issues where it was looking for certain params files that did not yet exist. For instance, my job was qdel'd at round 79, and segway attempted to recover directly from round 79 (but should have recovered from round 78). It was looking for params.0.params.79 but that had not been written yet before the qdel, in the train directory it was recovering from.
I experimentally changed
final_round_index
inrecover_train_instance
in run.py to belen(log_likelihoods)-1
and tried again, and it recovered fine after that.