hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

Jobs.tab occassionally missing lines - line Buffering for jobs.tab #93

Open EricR86 opened 7 years ago

EricR86 commented 7 years ago

Original report (BitBucket issue) by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


Currently jobs.tab is opened with a system default buffering scheme. On most systems this means that the file is fully buffered. In segway there are even calls to "flush" in hopes to force a write out of the current submitted job (in cluster/init.py). "flush" does not guarentee a write out and a os.fsync(f.fileno()) is necessary to guarentee it on synchronous filesystems.

Since each job is a tab-seperated line it would make sense if the jobs.tab file was opened with line buffering enabled. This might also help alleviate issues where sometimes there are missing newlines in the jobs.tab.

Another note that the same file handle is shared between all threads. Regular os.writes are not thread safe as they all advance the file pointer. There is potential cases where the jobs.tab has it's file pointer advancing by newer jobs and an older job still has it's file pointer to a previous location potentially overwriting a line.

Another potential solution is to have a separate jobs.tab per instance. This would isolate each jobs.tab on a per thread basis.

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


EricR86 commented 7 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


If even line-buffering won't make this thread-safe it sounds like having a separate file for each instance would be the simplest solution.

EricR86 commented 7 years ago

Original comment by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).


It sounds like you are proposing breaking up the most useful intermediate file in a segway run into multiple (possibly hundreds of?) individual files? Maybe the simplest solution isn't the best one in this case ...

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


@jayhesselberth if you're running hundreds of instances I'd imagine the probability of having race conditions in the jobs.tab is significantly worse. In this case, the jobs.tab produced has a much higher chance of just being incorrect and I'd imagine would have a much worse case for missing lines.

I'm totally up for suggestions. I don't currently see a really good solution to this.

EricR86 commented 7 years ago

Original comment by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).


maybe you can merge individiual files into a single jobs.tab after each bundle operation? At least that would keep them together in one file.

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


That could work. How would you merge them? Or would you just append the instances in order?

EricR86 commented 7 years ago

Original comment by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).


I'd probably just create individual files for each instance in the current training round, then append the instances in order after the bundle completes (and delete the individual training files).

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


@jayhesselberth just to be clear, you don't like the idea of having a separate jobs.tab per instance (and not per submitted job) correct? It'll be tied to the number --num-instances. Practically in our runs this number is usually around 10 so we don't see as an issue to go from 1 to 10 files.

EricR86 commented 7 years ago

Original comment by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).


I suppose it makes sense to have 1 jobs.tab file per instance. Thanks for figuring this out.