Open GoogleCodeExporter opened 9 years ago
I'm not sure this is a good one to fix. I could see the case for fsync'ing the
newly created .sst file, MANIFEST & CURRENT files to ensure the data is not
lost, but isn't any journaling filesystem going to be required to create a
separate transaction for the unlink() following a rename()? Forcing the fsync
on the dir in the middle there just seems like extra pain, particularly on a
filesystem with a write barrier. Am I missing something?
Original comment by cbsm...@gmail.com
on 30 Sep 2013 at 9:36
Quick question: Is performance the "pain" you talk about? Since you do a couple
of fsync calls while opening the database anyway, this would probably cost you
a maximum of 30 ms with a hard drive ...
My two cents: ext3/ext4 always persist the rename() before (or along with) the
unlink(), and I don't think that's going to change in the future. Being a
filesystem student myself though, I disagree with "any journaling filesystem
going to be required" - I can imagine (in my crazy head) delaying renames the
same way people did delayed allocations. It wouldn't serve any real purpose I
can think of, however. (Maybe somebody decides that immediately sending an
unlink to disk will help security? or maybe to get more space? I don't know).
I'm more worried about COW filesystems like Btrfs.
Original comment by madthanu@gmail.com
on 30 Sep 2013 at 10:49
I can see delaying when these are fully reflected in the filesystem, but don't
you have to commit them to the journal, ensuring that *eventually* the rename
will be reflected on disk? Doesn't that force perceived ordering at the VFS
level?
The performance pain I'm thinking of is the additional write barrier, which can
potentially be pretty nasty, particularly if the system has a lot of entries in
the directory.
Original comment by cbsm...@gmail.com
on 1 Oct 2013 at 7:37
Eventually on disk - Yes. Perceived ordering at the VFS level - No. The reason
is, by "delaying", I mean "delaying the rename *past* future unlinks". That is,
the unlink might be made part of a transaction that gets committed before the
rename, even though the user performed the unlink after the rename.
Explanation: For example, consider delayed allocation in ext4. When you append
to a file, new inode pointers (internal filesystem stuff) are created for the
new data; these pointers are filesystem metadata. ext3 used to happily commit
this metadata to the journal in-order, and everything was fine and dandy:
except for performance. ext4 came in and delayed these pointer-metadata *past*
other filesystem operations, for performance. Everything went crazy when they
were delayed past renames, because everybody was using "write(); rename();" for
atomicity. (There are other parts of the story, but that's for another day.)
However, delayed appends still mean the newly appended data will eventually end
up on-disk; just not before the renames.
Again, "delaying renames past unlinks" doesn't apply to the current ext3/ext4
filesystems - the perceived ordering at the VFS layer is perfectly preserved in
those filesystems. And, ext3/ext4 will need to do *additional* stuff to
implement delayed-renames; that makes sense only if there is a performance (or
feature) advantage, as there was in delayed allocations.
Saying that though, I think, when the btrfs guys fixed the filesystem to make
"write(); rename();" atomic (without an fsync), they made unlinks delay past
renames as a side-effect. (That's the opposite of what we are worried about.)
So, as far as I can see, the filesystem guys don't find anything wrong with
reordering renames and unlinks. There could be some filesystem out there that
had a similar side-effect harming us, while doing some crazy performance
optimization.
Performance: There are two concerns here.
In ext3, if you do "sleep(5); write (1-byte); fsync();", that fsync could take
a reallly long time. But, if you do "sleep(5); fsync(); write(1-byte);
fsync()", the *second* fsync will only take around a maximum of 30 ms. The
first fsync will take long, because of the well-known "ext3 sends *everything*
to disk when you fsync only one file" problem. In the second fsync, most of the
"everything" has already been sent to disk the first time, so it is fast. ext3
is the worst-performing filesystem I know of, in this aspect, so all other
filesystems should perform better.
About the actual directory being large - I guess the thing that ends up on the
disk after you flush the directory is only any changes to the directory
entries. So, unless we have a thousand files created, deleted, and renamed, it
wouldn't be a big deal. (Unless the filesystem that does something else
additionally when fsyncing a directory). Leveldb creates the directory, so it's
kinda owned by leveldb, so we probably wouldn't have a thousand files.
Everything said, 30 ms still sucks ... I can see not adding an additional fsync
for this reason, but it would be nice to create a "leveldb's assumptions about
the underlying filesystem" topic somewhere in the documentation.
Original comment by madthanu@gmail.com
on 1 Oct 2013 at 9:19
PS: I'm only a grad student studying filesystems, so take my replies with a
pinch of salt. No Ted T'so here.
Original comment by madthanu@gmail.com
on 1 Oct 2013 at 9:33
I'll take your filesystem expertise over mine any day now.
I can certainly understand how data can be missing from the files without some
kind of explicit calls to flush the data to disk (which was my understanding of
the "O_PONIES" issue. The bad outcome from a write(); rename(); problem was
*not* a 0 length file. It was a file with a length reflecting the write, but
which did not actually contain the *data* from the write (on XFS, it would be
filled with 0's, on some other filesystems you might see random data from
previous files). However, that is a case where you have a race between data &
metadata.
In the case of a rename() followed by an unlink() though... these are both
metadata operations. Now, if you don't have metadata journaling in your
filesystem, all bets are off, but assuming you do... I think the scenario you
are describing shouldn't be allowed to happen.
Granted, without an fsync() on the directory, you can't be sure that *either*
the rename() or the unlink() made it to disk, but because they are invoked
sequentially, I think you can guarantee that if the unlink() made it to the
journal, so did the preceding rename(). I say this partly because one of the
problems I hear parallel filesystems people make is how POSIX imposes some
annoying serialization of metadata operations to filesystems that they'd wish
never happened, but also partly because specifically with rename() and
unlink(), the filesystem would need to order the operations in its own internal
book keeping just to be sure it unlink'd the right file (how do you know which
inode a path refers to if you don't serialize operations?).
Now, I can understand why a filesystem might reorder any number of operations
in terms of when any number of metadata operations are completed, but it
doesn't make any sense (and opens a pretty big can of worms in terms of API
contracts) to delay recording metadata operations in the journal. The rename()
and unlink() might actually get committed from the journal in whatever order
might be imagined, but that of course is of no consequence in terms of metadata
loss, because once the metadata is committed to the journal, it should not get
lost.
Given what I'm saying, are you sure there is a real risk?
Regarding the performance stuff... The fix to correct this would only be to add
an fsync() on the directory, so no data would need to be committed. LevelDB
does kind of "own" the directory, but by its nature it does create a lot of
files in the directory and periodically creates and destroys them. So there is
potentially a lot of metadata that hasn't been committed to disk. Not insane
levels, but still... not sure it is worth belts & suspendering this aspect of
things.
Original comment by cbsm...@gmail.com
on 5 Oct 2013 at 12:24
While cbsm's reasoning about why metadata operations will not be re-ordered
holds true for simple metadata-journaling file systems, I wouldn't assume the
same things for most modern file systems. For example, btrfs clearly re-orders
metadata operations [1]. Also, when ext4 was introduced, the bad outcome from a
write(); rename(); *was a 0 length file* [2] (and not a garbage-filled file,
which was a problem when people used ext3-writeback). In general, while it is
not simple to do so, even "metadata journaling" file systems are trying more
and more to re-order metadata operations to improve performance. There is no
requirement by any standards (POSIX or otherwise) to maintain a sequential
ordering (even for directory operations).
For now, I do not think btrfs, xfs, or zfs, re-order operations in the exact
way that affects LevelDB. I’m certain that ext4 or ext3 do not (for now).
Future iterations of these file systems probably would. So, I do not know how
much LevelDB has to look into this issue. However, with the new versions of
LevelDB, an fsync() is being on the directory (while creating the MANIFEST
file) anyway; I guess there will be only be an additional 30 ms delay when
opening the database, even if another fsync() is added to address this issue
(in a HDD-based desktop). If performance is a concern here, it might make sense
to remove all those fsync() on directory that are being done now - afaik, all
*modern* file systems (not ext2), persist the directory entry when you do an
fdatasync() on the actual file.
Also, I found that this problem magically disappears if you do RepairDB after a
power-failure, and I have been doing that always. I guess RepairDB reconstructs
the MANIFEST file and such. I’m not sure if this is a recommendable strategy,
though.
[1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg31937.html
[2]
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-fi
le-problem/
Original comment by madthanu@gmail.com
on 23 Jul 2014 at 8:40
Original issue reported on code.google.com by
madthanu@gmail.com
on 17 Jul 2013 at 3:34