Closed wfvining closed 7 years ago
By marking the object stream as closed if ne_close()
fails and with OSF_ERRORS we can prevent pftool (and FUSE) from trying again to close the handle. I have this implemented in issue-176
. Needs further testing before being merged.
Resolved by ef845d06
Moving this from mar-file-system/erasureUtils#4 as it turns out to look like a marfs problem caused by a double
ne_close()
when writing an object with exactly the same size as the repo chunksize.Setup
I have been simulating "pod-level" failure conditions for marfs MC storage by creating and destroying symlinks that point to the same storage, for example:
With this setup if I delete the
/repo/pod1
symlink during a ne_write to an object stored at/repo/pod1/.../foo
the writes should continue to succeed, but on close when libne goes to rename the block files the rename should fail withENOENT
since the symlink is gone; however, sometimesne_write
will fail withSIGSEGV
and crash the calling program.Error Description
There appears to be a race condition that leads one of the ne_write calls failing with
SIGSEGV
if the symlink is deleted at exactly the right moment. In testing with pftool writing into marfs the error occurs here:The
ne_handle
has been opened before the symlink was deleted (all file-descriptors inhandle->FDArray
are non-negative). The segfault comes from thehandle->buffer
field, which has taken on an illegal value at some point between the last call to ne_write and the call from write_recovery_info:It seems likely that this can occur at other calls to
ne_write
, not just fromwrite_recovery_info
. I just happened to trigger the race at the call fromwrite_recovery_info
.Reproducer
Running a PFTool write using four processes and a loop that deletes and recreates the symlinks with a two second sleep between deletion and recreation appears to reliably trigger the race.