mar-file-system / marfs

MarFS provides a scalable near-POSIX file system by using one or more POSIX file systems as a scalable metadata component and one or more data stores (object, file, etc) as a scalable data component.
Other
96 stars 27 forks source link

Creating multi-files immediately after 200 uni-files leads to some uni files being invalid #163

Closed atorrez closed 7 years ago

atorrez commented 8 years ago

In testing the DAL implementation in garbage collection, I have seen in two consecutive move to trash operations (rm via fuse mount) that I am missing files in the gpfs trash directory tree. I know this because I run quotas before moving to trash and after moving to trash. I was able to identify the files that are missing by looking at the garbage collection log (when I ran marfs_gc with the no-delete option (-n)). I then searched for the file with a find command with no luck.

wfvining commented 8 years ago

Alfred, what kind of file were you trying to delete (uni/multi/packed)?

atorrez commented 8 years ago

I was deleting a directory that had uni,multi, and packed. The first time I lost 3 uni files when moving to trash. The second time I lost 1 uni file.

From: Will Vining notifications@github.com<mailto:notifications@github.com> Reply-To: mar-file-system/marfs reply@reply.github.com<mailto:reply@reply.github.com> Date: Thursday, September 15, 2016 at 2:48 PM To: mar-file-system/marfs marfs@noreply.github.com<mailto:marfs@noreply.github.com> Cc: Alfred Torrez atorrez@lanl.gov<mailto:atorrez@lanl.gov>, Author author@noreply.github.com<mailto:author@noreply.github.com> Subject: Re: [mar-file-system/marfs] Moving To Trash Losing File(s) (#163)

Alfred, what kind of file were you trying to delete (uni/multi/packed)?

You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mar-file-system/marfs/issues/163#issuecomment-247451001, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AB5L1jOEwL3A-l_ohCepRYZW6FBtX49wks5qqa8JgaJpZM4J-PBN.

jti-lanl commented 8 years ago

As I understood it, the reproducer is something like this:

$ touch foo $ for i in seq 0 99; do echo test > /marfs/ns/file$i; done $ rm -rf /marfs/ns/file* $ find /path/to/trash -newer foo -type f | wc -l

The find might return less than 200 files (2 for each file).

Tried it just now and it returned 200, but Alfred has seen it fail.

On Sep 15, 2016, at 2:50 PM, Alfred Torrez notifications@github.com wrote:

I was deleting a directory that had uni,multi, and packed. The first time I lost 3 uni files when moving to trash. The second time I lost 1 uni file.

From: Will Vining notifications@github.com<mailto:notifications@github.com> Reply-To: mar-file-system/marfs reply@reply.github.com<mailto:reply@reply.github.com> Date: Thursday, September 15, 2016 at 2:48 PM To: mar-file-system/marfs marfs@noreply.github.com<mailto:marfs@noreply.github.com> Cc: Alfred Torrez atorrez@lanl.gov<mailto:atorrez@lanl.gov>, Author author@noreply.github.com<mailto:author@noreply.github.com> Subject: Re: [mar-file-system/marfs] Moving To Trash Losing File(s) (#163)

Alfred, what kind of file were you trying to delete (uni/multi/packed)?

You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mar-file-system/marfs/issues/163#issuecomment-247451001, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AB5L1jOEwL3A-l_ohCepRYZW6FBtX49wks5qqa8JgaJpZM4J-PBN. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

wfvining commented 8 years ago

My initial thought is that if you created a uni file through fuse and deleted it immediately this behavior could arise. In that case, because FUSE calls release() asynchronously, trash_unlink() may not see the pre xattr (since it is not written until marfs_release()) and will just unlink the file without moving it to the trash. That said, I can't seem to cause this race by methods similar to the ones Jeff described above. I am running a bigger test now, but it will take some time to complete.

Given that the original problem appeared when deleting the files after running quotas it seems extra unlikely that this is cause by the asynchronous release.

wfvining commented 8 years ago

This may not be the same bug as above, but I have found a reliable reproducer for the race on fuse_release() (see issue #140) which leads to the same result: files that should be in the trash are missing.

for i in `seq 0 1999`
do
    echo "this is uni number $i" >/marfs/ns/wfv.uni.1
done

This should lead to 1999 trash files, as each time wfv.uni.1 is truncated the old version is moved to the trash. When I check the trash, however, there are far fewer files:

$ find /path/to/trash/ -name "wfv.uni.1*" -print | wc -l
474

which corresponds to 237 files in the trash.

brettkettering commented 8 years ago

Let’s have George add this test to the suite executed by Jenkins.

Thanks, Brett

From: Will Vining notifications@github.com<mailto:notifications@github.com> Reply-To: mar-file-system/marfs reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, September 16, 2016 at 9:37 AM To: mar-file-system/marfs marfs@noreply.github.com<mailto:marfs@noreply.github.com> Subject: Re: [mar-file-system/marfs] Moving To Trash Losing File(s) (#163)

This may not be the same bug as above, but I have found a reliable reproducer for the race on fuse_release() (see issue #140https://github.com/mar-file-system/marfs/issues/140) which leads to the same result: files that should be in the trash are missing.

for i in seq 0 1999do echo "this is uni number $i" >/marfs/ns/wfv.uni.1 done

This should lead to 1999 trash files, as each time wfv.uni.1 is truncated the old version is moved to the trash. When I check the trash, however, there are far fewer files:

$ find /path/to/trash/ -name "wfv.uni.1*" -print | wc -l 474

which corresponds to 237 files in the trash.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/mar-file-system/marfs/issues/163#issuecomment-247632924, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AB3mjBR2e1HuxS6aCbQFxzrXOm-3k6fRks5qqreigaJpZM4J-PBN.

wfvining commented 8 years ago

Alfred and I spent some time trying to break down what is going on. Under my user with my environment and as root on the node I typically run on (02) we still cannot reproduce the issue. Under Alfred's environment and on a different node (01) we can recreate the bug reliably as follows:

# ./make_uni_multi /marfs/ns/dir/d1

which creates 200 uni files and two multi files.

After the script is done, 3-5 of the unis have a restart xatter (even several minutes after the script completes, which seems to indicate that this is unrelated to the asynchronous behavior of fuse_release()).

Because they have the restart (and have apparently never been marfs_releas()ed) they are simply unlinked instead of being moved to the trash. This is apparently not a bug in the trash code, trash problems are just a side-effect.

If we remove the part of the script that creates multis, then all the uni files are created successfully.

We have found that attempting to cat one of the multis will cause fuse to crash.

# cat ./mul.001 >/dev/null
cat: mul.001: Software caused connection abort

After remounting fuse, we can delete the files, but they are in a peculiar state that causes a run of the garbage collector to segfault. If we delete the files without attempting to read them first (and thus crashing fuse) then GC works correctly.

wfvining commented 8 years ago

Adding a ten second sleep between creation of uni files and multi files prevents the error from occurring. Of the three FTAs this is reproducible on 01 and 03. I have not successfully reproduced it on 02.

The error arose initially in the gc-dal branch, but appears to be present in master as well.

wfvining commented 8 years ago

I cannot reproduce this by doing copies with PFTool that attempt to simulate the work load. It appears that this issue only arises when writing via fuse.

wfvining commented 8 years ago

I have successfully reproduced the issue on fta02. Seems to not be dependent on the node after all.

wfvining commented 8 years ago

This appears to be caused by an internal server error in the object store. Logs indicate that marfs_release() is called for the invalid files, but an error occurs which is ignored (since fuse ignore errors in release()).

< HTTP/1.1 500 Internal Server Error
< Date: Thu, 22 Sep 2016 20:50:04 GMT
< Server: Apache/2.2.26 (Red Hat Enterprise Web Server)
< X-Scal-Ring-Key: 61179C8B09B5932929EC2A5CF905FAC19F10B130
< X-Scal-Ring-Status: SCAL_RING_FAILURE
< Authentication-Info: [...]
< Content-Length: 180
< Connection: close
< Content-Type: text/html; charset=UTF-8
< 
* Closing connection #0
DBG: Return Code: 0 
DBG: aws_curl_exit: 'aws4c.c', line 2270

obj_stream dc200700 object_stream.c:325    s3_op_internal        | #ERR CURL ERROR: 2b4c9c03e4a0 500 '500 Internal Server Error'
obj_stream dc200700 object_stream.c:337    s3_op                 | #ERR failed (bparc/ver.001_004/ns.admins/F___/inode.0000184529/md_ctime.20160922_144852-0600_1/obj_ctime.20160922_144852-0600_1/unq.0/chnksz.40000000/chnkno.0) -1 (Unknown error) '500 Internal Server Error'
obj_stream c8401700 object_stream.c:217    stream_wait           | pthread_timedjoin_np returned 0 (errno=0)
obj_stream c8401700 object_stream.c:218    stream_wait           | thread retval (via os->op_rc) -1
obj_stream c8401700 object_stream.c:1062   stream_sync           | resetting connections in curl-handle
obj_stream c8401700 object_stream.c:1075   stream_sync           | op-thread returned -1
obj_stream c8401700 object_stream.c:1189   stream_close          | entry
obj_stream c8401700 object_stream.c:1229   stream_close          | done (returning -1)

It appears that the failure occurs in stream_sync(). The sequence of events from the perspective of the fuse thread doing the release() looks like this (there is overlap between this log snippet and the one above):

marfs_fuse.dbg c8401700 main.c:555             fuse_release          | -> fuse_release
marfs_fuse.dbg c8401700 main.c:555             fuse_release          | TRY0: push_user4(&ctx, fuse_get_context()->uid, fuse_get_context()->gid, (0))
marfs_fuse.dbg c8401700 push_user.c:231        push_user4            | gid 0(0) -> (0)
marfs_fuse.dbg c8401700 push_user.c:255        push_user4            | uid 0(0) -> (0)
marfs_fuse.dbg c8401700 marfs_ops.c:2000       marfs_release         | 
marfs_fuse.dbg c8401700 marfs_ops.c:2000       marfs_release         | -> marfs_release
marfs_fuse.dbg c8401700 marfs_ops.c:2001       marfs_release         | Releasing /admins/wfvining/test02/d1/uni.155
marfs_fuse.dbg c8401700 marfs_ops.c:2042       marfs_release         | TRY_GE0(write_recoveryinfo(os, info, fh))
marfs_fuse.dbg c8401700 common.c:2291          write_recoveryinfo    | TRY0(pre_2_str(&rec[idx], remain, &info->pre))
marfs_fuse.dbg c8401700 marfs_base.c:161       epoch_to_str          |  epoch_to_str epoch:            0000000057e443b4
marfs_fuse.dbg c8401700 marfs_base.c:161       epoch_to_str          |  epoch_to_str epoch:            0000000057e443b4
marfs_fuse.dbg c8401700 common.c:2302          write_recoveryinfo    | TRY0(post_2_str(&rec[idx], remain, &info->post, info->pre.repo, 0))
marfs_fuse.dbg c8401700 common.c:2383          write_recoveryinfo    | TRY_GE0(DAL_OP(put, fh, rec, MARFS_REC_UNI_SIZE))
obj_stream c8401700 object_stream.c:425    stream_put            | (2b4c9c03e4a0) entry
obj_stream c8401700 object_stream.c:475    stream_put            | (2b4c9c03e4a0) installed buffer (2943 bytes) for readfn
obj_stream c8401700 object_stream.c:482    stream_put            | (2b4c9c03e4a0) waiting 20s for IOBuf
obj_stream c8401700 object_stream.c:488    stream_put            | (2b4c9c03e4a0) buffer done
obj_stream c8401700 object_stream.c:1006   stream_sync           | (wr) sending empty buffer (flags=0x0003)
obj_stream c8401700 object_stream.c:425    stream_put            | (2b4c9c03e4a0) entry
obj_stream c8401700 object_stream.c:475    stream_put            | (2b4c9c03e4a0) installed buffer (0 bytes) for readfn
obj_stream c8401700 object_stream.c:482    stream_put            | (2b4c9c03e4a0) waiting 20s for IOBuf
obj_stream c8401700 object_stream.c:488    stream_put            | (2b4c9c03e4a0) buffer done
obj_stream c8401700 object_stream.c:1045   stream_sync           | waiting for op-thread
obj_stream c8401700 object_stream.c:217    stream_wait           | pthread_timedjoin_np returned 0 (errno=0)
obj_stream c8401700 object_stream.c:218    stream_wait           | thread retval (via os->op_rc) -1
obj_stream c8401700 object_stream.c:1062   stream_sync           | resetting connections in curl-handle
obj_stream c8401700 object_stream.c:1075   stream_sync           | op-thread returned -1
obj_stream c8401700 object_stream.c:1189   stream_close          | entry
obj_stream c8401700 object_stream.c:1229   stream_close          | done (returning -1)
marfs_fuse.dbg c8401700 marfs_ops.c:2106       marfs_release         | <- marfs_release
marfs_fuse.dbg c8401700 main.c:567             fuse_release          | TRY0: pop_user4(&ctx)
marfs_fuse.dbg c8401700 push_user.c:317        pop_user4             | uid (0) <- 0(0)
marfs_fuse.dbg c8401700 push_user.c:342        pop_user4             | gid (0) <- 0(0)
marfs_fuse.dbg c8401700 main.c:567             fuse_release          | <- fuse_release

Need to investigate why the internal server error occurred and what we should do to cope with such an error. There is no way to force fuse to return an error to whatever process closed the file in this case because release() is called asynchronously, and its return value is ignored. The obvious solution would be to move the stream_sync() call into marfs_flush(), but (as documented in issue #140) it is not possible to know that any given call to marfs_flush() will be the last call.

wfvining commented 8 years ago

Chris Hoffman reports that the RING for the ftas is healthy, so it seems that this is not just something that happens when the object store is in a bad state. I will work on moving the functionality from marfs_release() into marfs_flush() to at least see if that fixes the problem. If it does then we can work on a solution to the issue of dup()ed fds.

wfvining commented 8 years ago

I have implemented a fix in the issue-163 branch. Needs more testing and work to make it performant. Also currently assumes we never have duped file descriptors in fuse (for writing).

wfvining commented 8 years ago

Now fails test number 200 in Jenkins (creating symlinks in fuse). It appears that all other tests pass.

wfvining commented 8 years ago

Oops. We can't avoid dup'd fds, even for the time being, because that is how shells set up input/output redirections (duh). I still believe this fixes the issue, but it currently means that the only thing ever written to the files is recovery info when one of the dup'd fds is closed before the shell calls exec() (this behavior might depend on the implementation of the shell, but we can expect it to be roughly the same across all shells).

Perhaps the easiest way to address this would be to ignore calls to marfs_flush() if nothing has been written on the object stream.

Alternatively, we can do what we have already discussed and defer the opening of the OS until write() is actually called. That would mean we can call marfs_flush() all we want, up until there is a write, and the flush will effectively be a no-op (this might require some minor changes to marfs_flush()).

Neither of these solutions deals with fork(), where one process might close the fd after the other has done some writes. The only solution I can think of for this is to have marfs_write() record the pid (get_fuse_context()->pid) of the calling process and only honor the flush() that is also from that process. For all other processes marfs_flush() becomes a no-op. Unfortunately this means we are restricting users of marfs to only one writer per file through fuse (we can still allow N:1 writes via pftool, as the release/flush follows a different code path in that case).

wfvining commented 8 years ago

I have added defered opens to the OBJ DAL and moved the functionality from marfs_release() to marfs_flush(). This new code passes our integration tests and fixes the issue described here.

Still need to address the problem with fork() described above. A similar problem arises for programs that use dup'd fds where both fds might be closed after one or more write()s to either descriptor.

Putting all of the functionality from release into flush also leads to a pretty severe increase in the latency of close() because we have to wait for the TCP connection to the object store to be completely torn down which takes on the order of one second. I will investigate making the connection teardown code part of the ->destroy() DAL method and leaving the call to destroy until marfs_release(). Then we still guarantee that the connection is cleaned up, but it will happen asynchronously.

wfvining commented 7 years ago

I am making a different attempt to address this bug now that the lazy-open branch implements deferred opens for all MarFS DALs. The issue-163 branch will be abandoned in favor offlush`

I have the solution implemented, but need further integration testing.

brettkettering commented 7 years ago

Testing this has been lower priority while trying to run down the MC performance issues (e.g. ZFS context switching bug).

brettkettering commented 7 years ago

Will:

What's our readiness to ensure this is working as we expect and close this issue?

Thanks, Brett

wfvining commented 7 years ago

We now have multiple successful runs of the full test suite against the flush branch. I will make sure everything is read to merge and hopefully can close the issue today.

jti-lanl commented 7 years ago

One of the earlier messages mentions fuse crashes. Does that still happen sometimes?

Thanks.

On Feb 13, 2017, at 11:20 AM, Will Vining notifications@github.com wrote:

We now have multiple successful runs of the full test suite against the flush branch. I will make sure everything is read to merge and hopefully can close the issue today.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

wfvining commented 7 years ago

The fuse crashes were a symptom of the original bug. They have also been resolved.