hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

Stale file handle error #128

Open EricR86 opened 6 years ago

EricR86 commented 6 years ago

Original report (BitBucket issue) by Mickaël Mendez (Bitbucket: Mickael Mendez).


While running Segway 2.0.2 in reverse mode I ran into a Stale file handle error. Below are the logs and the command of the job that failed.

segway command

segway \
    --num-labels=10 \
    --resolution=2 \
    --ruler-scale=2 \
    --num-instances=10 \
    --reverse-world=1 \
    --max-train-rounds=50 \
    --seg-table=seg_table.tab \
    --minibatch-fraction=0.01 \
    --tracks-from=tracks.csv \
    --mem-usage=2,4,8,16 \
    train ...

Segway output

Traceback (most recent call last):
  File "~/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 181, in run
    new_deps = self._run_get_new_deps()
  File "~/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 119, in _run_get_new_deps
    task_gen = self.task.run()
  File "segway_workflow.py", line 141, in run
    segway_run.main(segway_cmd)
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3787, in main
    return runner()
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3552, in __call__
    self.run(*args, **kwargs)
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3519, in run
    self.run_train()
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 2947, in run_train
    instance_params = run_train_func(self.num_segs_range)
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3030, in run_train_multithread
    to find the winning instance anyway.""" % thread.instance_index)
AttributeError: Training instance 0 failed. See previously printed error for reason.
Final params file will not be written. Rerun the instance or use segway-winner
to find the winning instance anyway.

EM training error

The source of the error seem to come from the job: emt0.19.1233.train.637ed75e7b0f11e8975fbd311cda90ee

This jobs has two unexpected behavior:

EricR86 commented 6 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


This is likely to have to do with configuration issues on the cluster and not anything to do with Segway's programming. Nor is there anything likely we can really do about this by changing Segway.

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


I believe there may be a race condition here in the way observation files are managed in minibatch mode. Currently, observation filenames are not instance-specific. It seems possible, for example, that two instances could simultaneously be attempting to delete or open the same observation file for reading/writing.

Section A.10 of the NFS FAQ also mentions ESTALE errors being reported when referring to items that may have been deleted. The tofile call is a convenience call for an open/write operation and I'd imagine it's possible for the file to have been deleted by another instance and stale by the time a write happens. I'd imagine this possibly could happen on a non-NFS filesystem but it seems far less likely (and far more likely on NFS filesystems).

EricR86 commented 6 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


  1. Don't set TMPDIR to a networked file system. This will be documented as a requirement. Since we only run on Linux we could have segway-task spit out a warning if stat --file-system --format=%T "$TMPDIR" is nfs. But the user would only see that if they looked at the error files. Maybe it wouldn't be such a horrible idea for the console to print any errors that occur as it marks a job complete. This is all a bit complex though.
  2. Eric will introduce a fix to this problem even if that happens, by ensuring filenames are unique.