Data loss -- Fsync parent directory on file creation and rename

aganesan4 commented 8 years ago

I am running a three node mongoDB cluster. I am using mongoDB 3.0.11 with rocksdb as storage engine. When I insert a new item into the store, I set w=3, j=True. When running strace on mongod, these are the file-system operations that happen on the node:

creat("data_dir/db/000004.sst") append("data_dir/db/000004.sst") fdatasync("data_dir/db/000004.sst") creat("data_dir/db/MANIFEST-000005") append("data_dir/db/MANIFEST-000005") fdatasync("data_dir/db/MANIFEST-000005") creat("data_dir/db/000005.dbtmp") append("data_dir/db/000005.dbtmp") fdatasync("data_dir/db/000005.dbtmp") rename(source="data_dir/db/000005.dbtmp", dest="data_dir/db/CURRENT") unlink("data_dir/db/MANIFEST-000001") creat("data_dir/db/journal/000006.log") unlink("data_dir/db/journal/000003.log") fsync("data_dir/db") trunc("data_dir/mongod.lock") ----client insert request---- append("data_dir/db/journal/000006.log") ----client ack----

When a new file is created or a file is renamed, the parent directory needs be explicitly fsynced to persist the new file. Please see this: https://www.quora.com/Linux/When-should-you-fsync-the-containing-directory-in-addition-to-the-file-itself and http://research.cs.wisc.edu/wind/Publications/alice-osdi14.pdf. The log file and any further appends to it might be lost if the node crashes and the new file is not persisted. If the crash happens on two or more nodes on a three node cluster, one of these nodes could become the leader and a global data loss is possible. We have reproduced this particular data loss issue using our testing framework.

If the sst file goes missing or manifest file goes missing on a subsequent crash as the directory is not fsynced, the node fails to start again. As a fix, it would be safe to fsync the parent directory on creat or rename of files. This could result in the cluster becoming unavailable for quorum writes.

mdcallag commented 8 years ago

We have discussed this before. I thought RocksDB was doing the right thing but I haven't looked at that code recently. I see places where it is likely done...

find . -type f -name \*\.cc -print | xargs grep -i sync | grep -i direct
./utilities/backupable/backupable_db.cc:      backup_private_directory->Fsync();
./utilities/backupable/backupable_db.cc:      private_directory_->Fsync();
./utilities/backupable/backupable_db.cc:      meta_directory_->Fsync();
./utilities/backupable/backupable_db.cc:      shared_directory_->Fsync();
./utilities/backupable/backupable_db.cc:      backup_directory_->Fsync();
./utilities/checkpoint/checkpoint.cc:      s = checkpoint_directory->Fsync();
./utilities/env_librados.cc:  // Fsync directory. Can be called concurrently from multiple threads.
./utilities/persistent_cache/persistent_cache_test.cc:  rocksdb::SyncPoint::GetInstance()->SetCallBack("NewRandomAccessFile:O_DIRECT",
./utilities/persistent_cache/persistent_cache_test.cc:  rocksdb::SyncPoint::GetInstance()->SetCallBack("NewWritableFile:O_DIRECT",
./utilities/persistent_cache/persistent_cache_test.cc:  rocksdb::SyncPoint::GetInstance()->SetCallBack("NewRandomAccessFile:O_DIRECT",
./db/db_impl.cc:      s = directories_.GetWalDir()->Fsync();
./db/db_impl.cc:    status = directories_.GetWalDir()->Fsync();
./db/db_impl.cc:          // We only sync WAL directory the first time WAL syncing is
./db/db_impl.cc:          status = directories_.GetWalDir()->Fsync();
./db/db_impl.cc:      s = impl->directories_.GetDbDir()->Fsync();
./db/filename.cc:                      Directory* directory_to_fsync) {
./db/filename.cc:    if (directory_to_fsync != nullptr) {
./db/filename.cc:      directory_to_fsync->Fsync();
./db/compaction_job.cc:  if (output_directory_ && !db_options_.disableDataSync) {
./db/compaction_job.cc:    output_directory_->Fsync();
./db/version_set.cc:                         db_options_->disableDataSync ? nullptr : db_directory);
./db/flush_job.cc:    if (!db_options_.disableDataSync && output_file_directory_ != nullptr) {
./db/flush_job.cc:      output_file_directory_->Fsync();

igorcanadi commented 8 years ago

Thanks for the bug report. We do fsync the parent directory on the first WAL write. However, we do it only if you pass in the sync flag with your write. MongoRocks 3.0 has a known issue where it doesn't pass in a fsync flag even if it was requested. The bug is fixed in MongoRocks 3.2, can you please try upgrading?

This is where the parent directory fsync happens in RocksDB: https://github.com/facebook/rocksdb/blob/master/db/db_impl.cc#L4771

mongodb-partners / mongo-rocks

Data loss -- Fsync parent directory on file creation and rename #35