utsaslab / SplitFS

SplitFS: persistent-memory file system that reduces software overhead (SOSP 2019)
https://www.cs.utexas.edu/~vijay/papers/sosp19-splitfs.pdf
Other
163 stars 53 forks source link

Append/writes recovery failure due to inconsistent inode numbers #66

Open OmSaran opened 2 years ago

OmSaran commented 2 years ago

The append recovery logic currently depends on inode numbers of the file and the staging file stored in the append log. But the inode number of the file may change in some scenarios upon recovery.

Consider the following example happening in order (SplitFS Strict mode):

  1. file1 is created with size 0. Let its inode number be 1. A LOG_FILE_CREATE operation is created in oplog
  2. An append operation is done on file1. The contents are written on a staging file with say inode number 2 Also, an append log entry is created storing source (1) and destination (2) numbers.
  3. There's a crash (power failure) and there was no fsync. Lets assume it crashed after the append/write call returned to the application.

During recovery, the following happens:

  1. Op log recovery attempts to from step 1 in example attempts to re-create the file (file1 is lost due to lack of fsync, thus relies log recovery) via ext4-dax. This inode number is not guaranteed to be 1. Lets say it is 3 now.
  2. Append log recovery attempts to relink file with an invalid inode (1) and inode (3) and thus the append is lost.

To fix this, one solution that I could think of is to keep track of old and new inode numbers during op log recovery by creating a mapping between old and new inode numbers. During append log recovery use the new inode in place of the old one by examining the mapping.