Closed GoogleCodeExporter closed 9 years ago
Here's what's happening. When the byte(s) at the begging of a .ldb file (or
.sst) file get corrupted, the call to Block::NumRestarts() can return a crazy
value, which makes the "mid" variable in Seek take on a crazy value, which is
passed to GetStartPoint, and ultimately gets passed to memcpy inside
DecodeFixed32 in coding.h. Kaboom.
One simple fix is to put a sanity check in the Block::NewIterator function.
Iterator* Block::NewIterator(const Comparator* cmp) {
if (size_ < sizeof(uint32_t)) {
return NewErrorIterator(Status::Corruption("bad block contents"));
}
const uint32_t num_restarts = NumRestarts();
if (num_restarts == 0) {
return NewEmptyIterator();
} else if (num_restarts > 4096){ //check for crazy values, use max block size as ceiling
return NewEmptyIterator();
} else {
return new Iter(cmp, data_, restart_offset_, num_restarts);
}
}
Original comment by jpflana...@gmail.com
on 15 Nov 2013 at 11:29
[Copied from an earlier email message on 2013/11/19]
The code already contains sanity checks for NumRestarts (the Block constructor
sets size_ to zero if it sees an obviously bad restart value and the iterator
constructor returns an error if size_ is zero).
The reason your changes cause a crash in corruption_test.cc is because the
corruption occurs between the block constructor and the iterator construction.
The inserted corruption affects the in-memory contents of the constructed Block
since on posix leveldb uses mmap and so the corrupted bytes show up inside the
Block.
We could add another check in the iterator constructor as you suggest, but that
won't really fully fix the problem. The corruption could have just as easily
happened after the iterator was constructed. To fully protect from such
corruptions, we would need to instrument every access into the block contents
in the fast path of iterator's operations. That's a pretty big undertaking and
will also slow down reads. And it might be very hard to make the code
bulletproof (think of for example the binary search going into an infinite loop
because a corruption violates sorted-order of keys).
Given that this is a testing induced failure, I am not sure it is worth
complications and slowdowns to deal with this case.
Original comment by san...@google.com
on 9 Apr 2014 at 4:43
Original issue reported on code.google.com by
jpflana...@gmail.com
on 15 Nov 2013 at 8:03Attachments: