mar-file-system / erasureUtils

Erasure coding utilities intended for the marfs multicomponent DAL. These service the creation, retrieval, and maintenance of erasure coded data stripes spread accross multiple files.
Other
4 stars 5 forks source link

crash during server disruption #16

Closed jti-lanl closed 5 years ago

jti-lanl commented 7 years ago

Still digging into exactly what happened here. Basically, the servers were being hit hard (and/or disk writes to ZFS were performing poorly, forming a bottleneck) such that pftool encountered a bumpy patch with many calls to ne_open() failing. Somewhere in that stretch one of the bq_writer threads (serving ne_write()) got a segfault.

The BufferQueue struct (BQ) shows flags (BQ_ERROR | BQ_ABORT), and the BQ.buffers are all NULL (the proximate cause of the segfault), suggesting (I think) that this handle should already have been abandoned. Yet, we are calling bq_enqueue()

[This is in the rmda_fs_impl branch, btw. I'd guess the problem is not particular to that branch.]

#0 0x00002af6172ff060 in ?? () #1 #2 0x00002af60e3a5504 in __memcpy_ssse3_back () from /lib64/libc.so.6 #3 0x000000000043b901 in bq_enqueue (size=1048572, buf=0x2af618261040, bq=0x2af7bbfa39c8) at erasure.c:699 #4 ne_write (handle=0x2af7bbf9fae0, buffer=0x2af618261040, nbytes=1048576) at erasure.c:1934 #5 0x000000000042f064 in mc_put (ctx=, buf=, size=) at fuse/src/dal.c:1180 #6 0x000000000042b0ab in marfs_write (path=, buf=, size=1048576, offset=, fh=0x91b338) at fuse/src/marfs_ops.c:3110 #7 0x0000000000418f50 in MARFS_Path::write (this=0x91b300, buf=0x2af618261040 "", count=1048576, offset=298499408918) at Path.h:2663 #8 0x00000000004168cc in copy_file (p_src=std::tr1::shared_ptr (count 2, weak 0) 0x8f9370, p_dest=std::tr1::shared_ptr (count 2, weak 0) 0x91b300, blocksize=1048576, rank=rank@entry=4, o=...) at pfutils.cpp:1019 #9 0x000000000041f282 in worker_copylist (rank=4, sending_rank=, base_path=0x7fff9b9b9660 "/dev/shm", dest_node=0x7fff9b9ba680, o=...) at pftool.cpp:3024 #10 0x0000000000421585 in worker (rank=4, o=...) at pftool.cpp:1504 #11 0x000000000040dd11 in main (argc=10, argv=0x7fff9b9c1ab8) at pftool.cpp:589

gransom commented 5 years ago

This reminds me of a couple of issues that I encountered when implementing threaded reads.

The first was that there was an oversight in ne_open() which would allow a ne_handle structure to be returned even after the handle had been declared 'unsafe' (too many threads failed to open their destination block) and all write threads had been killed. This was corrected by commit ec0fe79d79492ccb8a68f522b3d13e9eb177c0a1 (see the line commented with 'don't hand out a dead handle!'), and I strongly suspect was the source of the segfault detailed here.

Another related issue, occurring during during the same UNSAFE check of a write handle, was a potential race condition during thread cleanup which could result in threads accessing memory which had previously been freed, resulting in a segfault within bq_writer(). This was corrected by commit 83b811c800cb5fd7c60b73c433294c5a6a3d31c3. I 'think' this second issue was a possibility at the time this issue was created, but perhaps not. Regardless, I'm pretty sure the first bug was the culprit in the above case.

Closing this issue in expectation of the problem now being fixed.