Closed jti-lanl closed 5 years ago
This reminds me of a couple of issues that I encountered when implementing threaded reads.
The first was that there was an oversight in ne_open() which would allow a ne_handle structure to be returned even after the handle had been declared 'unsafe' (too many threads failed to open their destination block) and all write threads had been killed. This was corrected by commit ec0fe79d79492ccb8a68f522b3d13e9eb177c0a1 (see the line commented with 'don't hand out a dead handle!'), and I strongly suspect was the source of the segfault detailed here.
Another related issue, occurring during during the same UNSAFE check of a write handle, was a potential race condition during thread cleanup which could result in threads accessing memory which had previously been freed, resulting in a segfault within bq_writer(). This was corrected by commit 83b811c800cb5fd7c60b73c433294c5a6a3d31c3. I 'think' this second issue was a possibility at the time this issue was created, but perhaps not. Regardless, I'm pretty sure the first bug was the culprit in the above case.
Closing this issue in expectation of the problem now being fixed.
Still digging into exactly what happened here. Basically, the servers were being hit hard (and/or disk writes to ZFS were performing poorly, forming a bottleneck) such that pftool encountered a bumpy patch with many calls to ne_open() failing. Somewhere in that stretch one of the bq_writer threads (serving ne_write()) got a segfault.
The BufferQueue struct (BQ) shows flags (BQ_ERROR | BQ_ABORT), and the BQ.buffers are all NULL (the proximate cause of the segfault), suggesting (I think) that this handle should already have been abandoned. Yet, we are calling bq_enqueue()
[This is in the rmda_fs_impl branch, btw. I'd guess the problem is not particular to that branch.]