Under certain failure conditions it is possible for marfs_write to fail with SIGSEGV when writing to a MC repo

wfvining commented 7 years ago

Moving this from mar-file-system/erasureUtils#4 as it turns out to look like a marfs problem caused by a double ne_close() when writing an object with exactly the same size as the repo chunksize.

Setup

I have been simulating "pod-level" failure conditions for marfs MC storage by creating and destroying symlinks that point to the same storage, for example:

/repo/pod0 -> /main-repo/pod0
/repo/pod1 -> /main-repo/pod0
/repo/pod2 -> /main-repo/pod0
...

With this setup if I delete the /repo/pod1 symlink during a ne_write to an object stored at /repo/pod1/.../foo the writes should continue to succeed, but on close when libne goes to rename the block files the rename should fail with ENOENT since the symlink is gone; however, sometimes ne_write will fail with SIGSEGV and crash the calling program.

Error Description

There appears to be a race condition that leads one of the ne_write calls failing with SIGSEGV if the symlink is deleted at exactly the right moment. In testing with pftool writing into marfs the error occurs here:

#0  memcpy () at ../sysdeps/x86_64/memcpy.S:267
#1  0x00000000004468e5 in ne_write (handle=0x13e8d60, buffer=0x7ffe85087430, nbytes=2943) at erasure.c:965
#2  0x0000000000436aab in mc_put (ctx=0x13e3030, buf=<value optimized out>, size=<value optimized out>) at fuse/src/dal.c:1005
#3  0x000000000042e145 in write_recoveryinfo (os=<value optimized out>, info=<value optimized out>, fh=0x13e20b8) at fuse/src/common.c:2565
#4  0x0000000000432c03 in marfs_release (path=<value optimized out>, fh=0x13e20b8) at fuse/src/marfs_ops.c:2022
#5  0x0000000000422734 in MARFS_Path::close (this=0x13e2080) at Path.h:2465
...

The ne_handle has been opened before the symlink was deleted (all file-descriptors in handle->FDArray are non-negative). The segfault comes from the handle->buffer field, which has taken on an illegal value at some point between the last call to ne_write and the call from write_recovery_info:

buffer = 0x2b8a426ad040, buffs = {
    0x2b8a426ad040 <Address 0x2b8a426ad040 out of bounds>, 0x2b8a427ad03c <Address 0x2b8a427ad03c out of bounds>, 
    0x2b8a428ad038 <Address 0x2b8a428ad038 out of bounds>, 0x2b8a429ad034 <Address 0x2b8a429ad034 out of bounds>, 
    0x2b8a42aad030 <Address 0x2b8a42aad030 out of bounds>, 0x2b8a42bad02c <Address 0x2b8a42bad02c out of bounds>, 
    0x2b8a42cad028 <Address 0x2b8a42cad028 out of bounds>, 0x2b8a42dad024 <Address 0x2b8a42dad024 out of bounds>, 
    0x2b8a42ead020 <Address 0x2b8a42ead020 out of bounds>, 0x2b8a42fad01c <Address 0x2b8a42fad01c out of bounds>, 
    0x2b8a430ad018 <Address 0x2b8a430ad018 out of bounds>, 0x2b8a431ad018 <Address 0x2b8a431ad018 out of bounds>, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
  buff_rem = 0, buff_offset = 0, FDArray = {21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 0, 0, 0, 0, 0, 0, 0, 0}

It seems likely that this can occur at other calls to ne_write, not just from write_recovery_info. I just happened to trigger the race at the call from write_recovery_info.

Reproducer

Running a PFTool write using four processes and a loop that deletes and recreates the symlinks with a two second sleep between deletion and recreation appears to reliably trigger the race.

wfvining commented 7 years ago

By marking the object stream as closed if ne_close() fails and with OSF_ERRORS we can prevent pftool (and FUSE) from trying again to close the handle. I have this implemented in issue-176. Needs further testing before being merged.

wfvining commented 7 years ago

Resolved by ef845d06

mar-file-system / marfs