Open EricR86 opened 8 years ago
Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).
My current theory about this error is from run_train_multithread
:
#!python
with Session() as session:
try:
for instance_index, instance_features in enumerator:
...
thread = TrainThread(self, session, instance_index,
num_seg)
thread.start()
threads.append(thread)
...
except KeyboardInterrupt:
...
for thread in threads:
thread.join()
raise
It would seem that due to an error in one of the instances/threads where it couldn't find a file an exception was not caught. This would exit the context manager and essentially nullify the shared session object and not perform the graceful joining of the remaining threads.
Is there a reason why each thread doesn't contain it's own session object? Unsafe perhaps?
It looks like the error is caused by a failed recover when trying to open a file that doesn't exist.
Original report (BitBucket issue) by Rachel Chan (Bitbucket: rcwchan).
I received the following error while trying to run segway on recover (note: this happened in every thread):
Also received this error:
I checked, and
params.0.params.79
did not originally exist in partial.traindir/. The initial job was killed using qdel (since it was qsub'd) and was probably in the midst of round 79 or so at the time.