Stale file handle error

EricR86 commented 6 years ago

Original report (BitBucket issue) by Mickaël Mendez (Bitbucket: Mickael Mendez).

While running Segway 2.0.2 in reverse mode I ran into a Stale file handle error. Below are the logs and the command of the job that failed.

segway command

segway \
    --num-labels=10 \
    --resolution=2 \
    --ruler-scale=2 \
    --num-instances=10 \
    --reverse-world=1 \
    --max-train-rounds=50 \
    --seg-table=seg_table.tab \
    --minibatch-fraction=0.01 \
    --tracks-from=tracks.csv \
    --mem-usage=2,4,8,16 \
    train ...

Segway output

Traceback (most recent call last):
  File "~/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 181, in run
    new_deps = self._run_get_new_deps()
  File "~/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 119, in _run_get_new_deps
    task_gen = self.task.run()
  File "segway_workflow.py", line 141, in run
    segway_run.main(segway_cmd)
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3787, in main
    return runner()
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3552, in __call__
    self.run(*args, **kwargs)
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3519, in run
    self.run_train()
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 2947, in run_train
    instance_params = run_train_func(self.num_segs_range)
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3030, in run_train_multithread
    to find the winning instance anyway.""" % thread.instance_index)
AttributeError: Training instance 0 failed. See previously printed error for reason.
Final params file will not be written. Rerun the instance or use segway-winner
to find the winning instance anyway.

EM training error

The source of the error seem to come from the job: emt0.19.1233.train.637ed75e7b0f11e8975fbd311cda90ee

The job's output file is empty instead of containing the expected: ____ PROGRAM ENDED SUCCESSFULLY WITH STATUS___

Content of the Job's error output:

Traceback (most recent call last):
File "~/anaconda2/bin/segway-task", line 6, in <module>
sys.exit(segway.task.main())
File "~/anaconda2/lib/python2.7/site-packages/segway/task.py", line 592, in main
return task(*args)
File "~/anaconda2/lib/python2.7/site-packages/segway/task.py", line 582, in task
outfilename, *args)
File "~/anaconda2/lib/python2.7/site-packages/segway/task.py", line 476, in run_train
supervision_data=supervision_cells)
File "~/anaconda2/lib/python2.7/site-packages/segway/observations.py", line 479, in _save_window
int_data.tofile(int_filename)
IOError: [Errno 116] Stale file handle

This jobs has two unexpected behavior:

Usually the jobs raising this error are re-submitted. This one is not
The job is not indexed in jobs.tab

EricR86 commented 6 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).

This is likely to have to do with configuration issues on the cluster and not anything to do with Segway's programming. Nor is there anything likely we can really do about this by changing Segway.

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

I believe there may be a race condition here in the way observation files are managed in minibatch mode. Currently, observation filenames are not instance-specific. It seems possible, for example, that two instances could simultaneously be attempting to delete or open the same observation file for reading/writing.

Section A.10 of the NFS FAQ also mentions ESTALE errors being reported when referring to items that may have been deleted. The tofile call is a convenience call for an open/write operation and I'd imagine it's possible for the file to have been deleted by another instance and stale by the time a write happens. I'd imagine this possibly could happen on a non-NFS filesystem but it seems far less likely (and far more likely on NFS filesystems).

EricR86 commented 6 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).

Don't set TMPDIR to a networked file system. This will be documented as a requirement. Since we only run on Linux we could have segway-task spit out a warning if stat --file-system --format=%T "$TMPDIR" is nfs. But the user would only see that if they looked at the error files. Maybe it wouldn't be such a horrible idea for the console to print any errors that occur as it marks a job complete. This is all a bit complex though.
Eric will introduce a fix to this problem even if that happens, by ensuring filenames are unique.

hoffmangroup / segway