tokee / lucene-solr

High cardinality faceting (SOLR-5894)
http://tokee.github.io/lucene-solr/
7 stars 1 forks source link

Adaptive sampling for heuristic faceting #45

Open tokee opened 9 years ago

tokee commented 9 years ago

Currently sampling is handled by calculating a single chunk size over the full document count (maxDoc) and only update counts for the documents that are set in the result bitmap. This means that the number of documents can vary a great deal due to chance.

An alternative would be an adaptive strategy: Instead of chunk size being based on maxDoc, it can be calculated with hitCount/indexMaxDoc * segmentMaxDoc * samplingFactor / numChunks. The first chunk is finished when the number of matching documents is equal to chunk size. The skip to the next chunk is then calculated based on the current position in the result bitmap.

tokee commented 9 years ago

Preliminary testing suggests that this sampling is prone to bad guesses. The working theory is that clusters gets over-represented.