tokee / lucene-solr

High cardinality faceting (SOLR-5894)
http://tokee.github.io/lucene-solr/
7 stars 1 forks source link

Scale sample factor with result hit count for heuristic faceting #44

Closed tokee closed 9 years ago

tokee commented 9 years ago

Currently heuristic faceting is controlled by a hitCount based limit facet.sparse.heuristic.fraction and a maxDocuments based facet.sparse.heuristic.sample.size.

The constant sample size means that the chance of errors with low hitCounts is high and the performance overhead with large hitCounts is unnecessarily high. Using a hitCounts based function for the sample size instead of a maxDocuments would remedy this.

If h is hitCount, d is maxDocuments and a & b are empirically determined constants, the formula a*(h/d)+c provides an output linear to hitCount over maxDocuments. Sanity checking by turning off heuristics when the sample size is > 50% and forcing a sample size >= 1‰ seems prudent.

Conservative: d = 100M, a = -1.0 and b = 1.0.

Aggressive: d = 100M, a = -1.0 and b = 0.5.

Empiric based example: d = 250M, a = -19 and b = 0.78.

tokee commented 9 years ago

The parameters for heuristics now makes it possible to scale sampling with both hitCount and maxSize, with three different strategies for sampling.