tokee / lucene-solr

High cardinality faceting (SOLR-5894)
http://tokee.github.io/lucene-solr/
7 stars 1 forks source link

SparseFacetDistribTest sometimes fails #9

Closed tokee closed 10 years ago

tokee commented 10 years ago

Running the unit test SparseFacetDistribTest sometimes results in a difference between standard Solr and sparse results (look for "Solr fc and sparse results should be equal except for QTime").

NOTE: reproduce with: ant test -Dtestcase=SparseFacetDistribTest -Dtests.method=testDistribSearch -Dtests.seed=92804C2B4358CF2C -Dtests.locale=vi_VN -Dtests.timezone=Brazil/West -Dtests.file.encoding=UTF-8

tokee commented 10 years ago

The unit test SparseFacetDocValuesTest.java has been introduced. It is single-shard, fast to execute and fails with an invalid count for a high-frequency term when executed with

ant test  -Dtestcase=SparseFacetDocValuesTest -Dtests.seed=9F6370EDD2D1DB5A
-Dtests.locale=ar_SY -Dtests.timezone=Asia/Vientiane -Dtests.file.encoding=UTF-8
tokee commented 10 years ago

The bug seems to stem from the calculation of maximum term count for DocValue fields. As a temporary measure until it is fixed, setting facet.sparse.packed=false seems to work. This disables the memory saving part of sparse faceting but does not affect performance.

Longer description:

Term counts takes up a lot of memory. Standard Solr fc uses an integer for each unique value per concurrent call. For 40M values that is 160MB. Sparse faceting calculates the maximum count for any of the unique values in the facet and uses a PackedInts structure to hold the full structure. If a search for : has 'H.C.Andersen' as the most popular author with a count of 3000, that means that at most 12 bits/counter entry is needed (2^12 = 4096). If there are 40M values, that means 60MB per concurrent call. On top of that comes the pointer structure needed to keep track of the sparse counts, which is an integer for 8% of #unique values. With the 40M values, that is 13MB, for a total of 73MB.

As the calculation of the max count is currently unreliable, it is best to disable the PackedInts implementation. facet.sparse.packed=false switches to an implementation that uses integers for counters, just as with standard Solr. With the 40M unique terms example, there is still the overhead of 13MB for the sparse pointer, for a total of 173MB per concurrent search.

Performance should be about the same: PackedInts requires more processing than int[], but as it takes up less RAM, chances are higher that parts of the array is in Level 2 CPU cache when accessed. For relatively small amounts of counters, int[] is normally fastest. For relatively large amounts, PackedInts tends to win. All depending on the concrete corpus.

DmitryKey commented 10 years ago

with facet.sparse.packed the counts are where they should be (not a systematic check just yet). The speed is awesome!

tokee commented 10 years ago

The fix was not trivial, neither in implementation or resources needed. It essentially performs a faceting for all live docIDs (akin to the query :) the first time the field is used for sparse faceting and once every time the index is updated. When counting, there is a temporary overhead of int[maxDoc], but that is released for GC before the real counters are allocated so it should not require a larger heap.