tokee / lucene-solr

High cardinality faceting (SOLR-5894)
http://tokee.github.io/lucene-solr/
7 stars 1 forks source link

Guessing of cut-off point seems to be bad #14

Closed tokee closed 9 years ago

tokee commented 9 years ago

For a high facet-value/docs ratio, guessing of cut-off point for DocValues seems to be off. When sparse faceting errs to the low side, it is counted as exceededCutoffs in the stats and results in increased processing time (worse than standard Solr).

This seems to be caused by wrong calculation of the number of total references to the terms in the facet field and is related to issue #9. The same code can be used to correctly calculate the number of references, which should make guessing better. The increase in startup time should be unnoticeable.

tokee commented 9 years ago

The ground work for solving this with a general and correct guesser has been laid with 5b7a502. It needs some duplicate code elimination and the right probability formula instead of the naïve "if each document has 5 terms on average, then 3 documents has 3*5 unique terms" estimation currently used.