utsaslab / WineFS

WineFS (SOSP 21): a huge-page aware file system for persistent memory
34 stars 2 forks source link

Possible crash consistency issue with truncate using multiple file descriptors #9

Open hayley-leblanc opened 2 years ago

hayley-leblanc commented 2 years ago

Hi Rohan,

I think I've found a new crash consistency issue in WineFS that occurs if multiple file descriptors are used to modify a file and we crash while increasing the file's size using truncate. The bug is a bit complicated and unfortunately I don't have a good way to replicate it outside of my testing infrastructure and don't fully understand the root cause yet, so let me know if anything doesn't make sense. I'm running WineFS in strict mode; I don't know if this occurs in relaxed mode as well.

Suppose we have a program that performs the following operations:

1. create file0
2. fallocate file0 in FALLOC_FL_KEEP_SIZE mode, offset 0x4ce0, len 0x1 
3. write 1 byte to file0
4. ftruncate file0 to 1 byte
5. write 1 byte to file0
6. ftruncate file0 to 64 bytes

The workload is a bit strange (it was generated by our fuzzer) but I've minimized it as much as I can without removing the ability to reproduce the issue.

If we open a file descriptor on line 1 and use it for the rest of the operations, everything works as expected. However, suppose we open a new file descriptor for file0 between lines 2 and 3 and use it for the write and ftruncate on lines 3 and 4, then open another one betwen lines 4 and 5 and use it for the final two operations (as in this program: contenthash-test.zip).

Now, let's say that we crash after the flush on line 2091 of inode.c (flushing the head of the truncate list in pmfs_truncate_add()) goes through and that data becomes durable when performing the second ftruncate(). When the file system is re-mounted and runs recovery, it will recover the truncate list and replay the truncate operation on the truncated inode. During normal/non-crashing execution, the truncate operation causes the extra bytes to be zeroed out, but it seems that this step is skipped here during recovery, and the recovered file can end up with some random garbage in it.

I haven't yet narrowed down a root cause for this; truncation during normal execution vs. recovery use mostly the same code, and I haven't had the opportunity to dig into where things differ. I suspect right now that the use of multiple file descriptors causes some weird regular-execution behavior where the persistent inode might not look the same as it would if we just used 1 file descriptor, so recovery goes a bit differently, but I'm not sure.

Let me know if this makes sense or you have any ideas about what might be going on. I'll keep looking into this issue as well.

paulwedeck commented 1 year ago

Hello, I further investigated this issue. I am not quite sure if this is really the bug you described but it can be reproduced very similarly. Instead of step 2, write two bytes to the file. If the following ftruncate crashes, the file content may be lost.

I think the bug works as follows:

While decreasing the tree height, the new height and new root block using one instruction. However, this operation is not guaranteed to be atomic. If only the new height is stored, the inode tree is in an invalid state. A read will not actually reach the leaves of the tree but only read from the inner nodes.

The underlying issue is that WineFS (like PMFS) uses cmpxchg16b to "atomically" store 16. However, it is not architecturally guaranteed that this is actually executed atomically. See this comment from Andy Rudoff or this Intel page:

In x86 processors, any store to memory has an atomicity guarantee of only eight bytes

Our testing tool splits this store into two 8 byte stores (first the height, then the root block). I think this is a reasonable approximation. Other splits are possible and may damage the file system in different ways.

One bug fix/workaround is to enclose this cmpxchg16b call in a transaction.