tokee / lucene-solr

High cardinality faceting (SOLR-5894)
http://tokee.github.io/lucene-solr/
7 stars 1 forks source link

Heuristically accurate counts for large result sets #38

Open tokee opened 9 years ago

tokee commented 9 years ago

High-cardinality (#references) facet calls are heavy and the end result correctness can only be guaranteed by a full count. Sampling (see issue #37) can speed this up, at the cost of accuracy.

Normally, sampling will yield the correct terms in the top-X result, but with inaccurate counts. By using the sampling to get top-X terms and then fine-count those terms with the vanilla Solr field:term-search way of getting counts, accurate counts would be guaranteed.

Normally the field:term-searches are relatively slow, but with large result sets they get comparatively fast. As long as the sample size is "large enough", chances are very high that processing will be fast & correct. It is worth to note that vanilla Solr distributed faceting that does not guarantee correctness either, just "probably correct terms, guaranteed correct counts".

tokee commented 9 years ago

@tboeghk (Torsten Bøgh Köster) suggested https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/ConcurrentStreamSummary.java which seems very fitting for high result count speed-up.