utsaslab / pebblesdb

The PebblesDB write-optimized key-value store (SOSP 17)
BSD 3-Clause "New" or "Revised" License
506 stars 99 forks source link

PebblesDB does not discard partially-flushed values #28

Open mj-ramos opened 1 year ago

mj-ramos commented 1 year ago

Verified in:

What happened: After experiencing a power failure while adding values to PebblesDB with the verify_checksums and paranoid_checks parameters set to true, database gets corrupted. After applying the recovery method suggested in https://github.com/google/leveldb/blob/main/doc/index.md (using RepairDB), a value that was partially persisted is present.

The root cause of the problem is that some writes to the log file exceed the common size of a page at the page cache. This can result in a "torn write" scenario where only part of the write's payload is persisted while the rest is not, since the pages of the page cache can be flushed out of order. There are several references about this problem:

This problem was already reported in leveldb https://github.com/google/leveldb/issues/251 and does not exist in the latest release (1.23).

How to reproduce This issue can be replicated using LazyFS, a file system capable of simulating power failures and the behavior of the OS mentioned above, i.e., simulating file system pages persisted out of order at the disk. The main problem is a write to the file 000003.log which is 12288 bytes long. LazyFS will persist portions (in sizes of 4096 bytes) of this write out of order and will crash, simulating a power failure. To reproduce this problem, one can follow these steps (the mentioned files write_test.cpp, etc., are in this zip pebblesdb_test.zip):

  1. Mount LazyFS on a directory where PebblesDB data will be saved, with a specified root directory. Assuming the data path for PebblesDB is /home/pebblesdb/data and the root directory is /home/pebblesdb/data-r, add the following lines to the default configuration file (located in the config/default.toml directory):

    [[injection]]
    type="split_write"
    file="/home/pebblesdb/data-r/000003.log"
    persist=[1,3]
    parts=3
    occurrence=4

    These lines define a fault to be injected. A power failure will be simulated after writing to the /home/pebblesdb/data-r/000003.log file. Since this write is large (12288 bytes), it is split into 3 parts (each with 4096 bytes), and only the first and the third parts will be persisted. Specify that it's the fourth write issued to this file (with the parameter occurrence).

  2. Start LazyFS with the following command: ./scripts/mount-lazyfs.sh -c config/default.toml -m /home/pebblesdb/data -r /home/pebblesdb/data-r -f

  3. Compile and execute the write_test.cpp file, that adds 4 pairs of key-values to PebblesDB, where the third pair is the only one that exceeds the size of a page at the page cache .

Immediately after this step, PebblesDB will shut down because LazyFS was unmounted, simulating the power failure. At this point, you can analyze the logs produced by LazyFS to see the system calls issued until the moment of the fault. Here is a simplified version of the log:

{'syscall': 'write', 'path': '/home/pebblesdb/data-r/000003.log', 'size': '262144', 'off': '0'}
{'syscall': 'read', 'path': '/home/pebblesdb/data-r/000003.log', 'size': '131072', 'off': '0'}
{'syscall': 'write', 'path': '/home/pebblesdb/data-r/000003.log', 'size': '4096', 'off': '0'}
{'syscall': 'fsync', 'path': '/home/pebblesdb/data-r/000003.log'}
{'syscall': 'write', 'path': '/home/pebblesdb/data-r/000003.log', 'size': '4096', 'off': '0'}
{'syscall': 'fsync', 'path': '/home/pebblesdb/data-r/000003.log'}
{'syscall': 'write', 'path': /home/pebblesdb/data-r/000003.log', 'size': '12288', 'off': '0'}
  1. Remove the fault from the configuration file, unmount the filesystem with fusermount -uz /home/pebblesdb/data
  2. Mount LazyFS again with the previously provided command.
  3. Attemp to start PebblesDB (it fails).
  4. Compile and execute the repair.cpp file that recovers the database.
  5. Compile and execute the read_test.cpp file that reads and checks the values previously inserted. The value for the key k3 is only part of the initial value.

Note that when paranoid_checks and verify_checksums are set to false, PebblesDB does not fail on restart and discards the partial value of the key k3 (says that this key does not exist).