Open GoogleCodeExporter opened 9 years ago
The easy answer here is to say that if a client doesn't want this to happen, it
shouldn't request an async write followed by a sync write. Further, there could
be scenarios where the described behavior is desired: the first write is
considered "unimportant" and the client prefers to risk losing it rather than
waiting for the disk sync, but the second write is "important" and worth the
performance hit to ensure it is not lost.
Who knows how common such a scenario is, but I think it's worth pointing out.
There's also the question of whether offering clients such flexibility is worth
the confusion that you encountered.
Original comment by dgrogan@chromium.org
on 17 Jul 2013 at 6:58
I'm usually wary of the exact guarantees given by any storage library when it
comes to things like these, but here's the statement in the documentation that
led me astray: "A hybrid scheme is also possible where every Nth write is
synchronous, and in the event of a crash, the bulk load is restarted just after
the last synchronous write finished by the previous run. (The synchronous write
can update a marker that describes where to restart on a crash.)"
As I understand it, that "scheme" just won't work unless it is guaranteed that
the async writes are ordered before the synchronous writes.
Also, I assumed that (I have no idea where I got that assumption from, it was
probably from some old emails in the mailing list),
(1) given a couple of asynchronous Put()s that are all going to the same log
file, and,
(2) if a power-loss happens and some of them hadn't actually reached the disk,
then,
(3) during recovery LevelDB detects those Put()s that have not gone to the
disk, and *discards* all future Put()s, to ensure that the ordering of the Put
requests is maintained.
Is that not true?
I also have some opinion on what the default guarantee should be, particularly
about performance, but I guess I should rather post that in the mailing list,
right? (Or is this the right place to do it?)
Original comment by madthanu@gmail.com
on 17 Jul 2013 at 8:23
I had forgotten about that suggestion.
A nonsurgical fix for this issue would be to add a `logfile_->Sync()` just
before `delete logfile_` near the end of DBImpl::MakeRoomForWrite.
As for your hypothetical steps, I don't think that's how it works. If those
Put()s hadn't made it to the log file on disk, how would LevelDB know about
them during recovery after a reboot?
Original comment by dgrogan@chromium.org
on 18 Jul 2013 at 12:18
As far as I understand the source code, that should probably be the easiest fix.
The hypothetical steps: Well, I was assuming, since there checksums for records
in the log file, they'd be checked when you recover. If a particular user
record hadn't gone to disk fully, the checksum wouldn't match, and you can
detect and discard that record. Just out of curiosity, is at least this much of
my assumption correct? Or are there chances that partly-formed records might be
recovered as well? I was also assuming that you'd discard all future records,
but I am obviously wrong on that (based on your views).
I should note here that if my hypothetical "discard every future record" steps
are wrong, and somehow LevelDB does recover every checksum-valid record, there
is a chance (albeit very tiny in the real world) that the "scheme" from the
documentation does not work. Not sure ... just mentioning. I can imagine the
Nth synchronous write fully going to disk (even before the msync completely
returns), but the previous asynchronous record (in the same log file) not
having persisted during the power-failure.
Original comment by madthanu@gmail.com
on 18 Jul 2013 at 12:40
In version 1.13, SyncDirIfManifest() method is created and is called by Sync().
SyncDirIfManifest() flushes current directory if the target file is started
with "MANIFEST".
As my understand, manifest file is changed at every compaction time.
It means that old log file would not be missed after power loss by this change.
I think this change can solve the whole data loss of old log file, at least.
Original comment by yongilj...@gmail.com
on 18 Jul 2014 at 12:30
If I understand it correctly, SyncDirIfManifest() does an fsync() on the
directory. This issue is not about such an fsync(); it is about an fsync() on
the log file itself. Particularly, when switching log files, the new log file
might be written to (and pushed to the disk by the OS) without the old log file
ever being pushed to disk (since no fsync() was done on it).
Saying that, the issue might be fixed in the current version; I am not sure.
When a Put() results in a log compaction, can it return before the old log file
gets compacted? Just looking at a simple strace, it does seem to.
Original comment by madthanu@gmail.com
on 23 Jul 2014 at 3:20
@madthanu
Yes, you are right.
My point was we still can loose some of data in a log file if we don't use
"sync: true" options but not full data in a log file.
My comment mentioned about loosing files that are not flushed to inode about
current directory. In compaction, old log file would be removed and new sst
file would be created. But, I think if we don't call fsync() for directory then
files can be missed from directory because of this changed information is not
flushed.
In this issue, you mentioned about loosing contents of old log files. However,
I think it is occurred by new sst file. sst file is created but not flushed to
directory and it can be loosed by power off.
(Why also old log file is removed? Hmmmm I don't know... )
Original comment by yongilj...@gmail.com
on 10 Aug 2014 at 11:31
Original issue reported on code.google.com by
madthanu@gmail.com
on 17 Jul 2013 at 3:12