linearregression / hypertable

Automatically exported from code.google.com/p/hypertable
GNU General Public License v2.0
0 stars 0 forks source link

Improve BloomFilter sizing estimate #230

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Current BloomFilter sizing estimate is overly conservative and could result
in excessive memory usage. Consider the case where the MergeScanner returns
the following cells to be considered for insertion into a new CellStore:

InsertA
DeleteA
InsertB
InsertC
InsertD
DeleteD
InsertE

If this is a minor compaction it should contain 5 cells (Insert B,C,E and
Delete A,D) not 7. 
If there is a major compaction only 3 cells need to be stored (Insert B,C,E)

Currently the number of items in the BloomFilter is estimated to be:
(n/s)*M, where n=number of unique keys inserted into the BloomFilter over a
sample of s scanned cells and M is the estimated number of entries in the
final CellStore.

Currently M = T = total # cells in CellCache (minor compaction/in mem table)
                = total # cells in CellCache + CellStores (major compaction)

M does not account for deletes in the CellCache/Stores which reduce the
final number of entries in the new CellStore. A better estimate is:
M = f/s * T , where f is the number of cells that will be stored in the new
CellStore over a sample of s scanned cells. 

The MergeScanner needs to maintain statistics about how many cells it has
scanned to produce the cells to be inserted into the new CellStore.

Original issue reported on code.google.com by sjha...@gmail.com on 7 Feb 2009 at 1:10

GoogleCodeExporter commented 9 years ago

Original comment by nuggetwh...@gmail.com on 11 Apr 2010 at 9:27

GoogleCodeExporter commented 9 years ago

Original comment by nuggetwh...@gmail.com on 14 Jan 2012 at 8:32