Current BloomFilter sizing estimate is overly conservative and could result
in excessive memory usage. Consider the case where the MergeScanner returns
the following cells to be considered for insertion into a new CellStore:
InsertA
DeleteA
InsertB
InsertC
InsertD
DeleteD
InsertE
If this is a minor compaction it should contain 5 cells (Insert B,C,E and
Delete A,D) not 7.
If there is a major compaction only 3 cells need to be stored (Insert B,C,E)
Currently the number of items in the BloomFilter is estimated to be:
(n/s)*M, where n=number of unique keys inserted into the BloomFilter over a
sample of s scanned cells and M is the estimated number of entries in the
final CellStore.
Currently M = T = total # cells in CellCache (minor compaction/in mem table)
= total # cells in CellCache + CellStores (major compaction)
M does not account for deletes in the CellCache/Stores which reduce the
final number of entries in the new CellStore. A better estimate is:
M = f/s * T , where f is the number of cells that will be stored in the new
CellStore over a sample of s scanned cells.
The MergeScanner needs to maintain statistics about how many cells it has
scanned to produce the cells to be inserted into the new CellStore.
Original issue reported on code.google.com by sjha...@gmail.com on 7 Feb 2009 at 1:10
Original issue reported on code.google.com by
sjha...@gmail.com
on 7 Feb 2009 at 1:10