tokee / lucene-solr

High cardinality faceting (SOLR-5894)
http://tokee.github.io/lucene-solr/
7 stars 1 forks source link

Ideas for sanity checking in Solr #54

Open tokee opened 8 years ago

tokee commented 8 years ago

The meta-idea is to add a debug-option called "sanity". When enabled, sanity checking of options, index and general setup is performed and potential problems are reported. This Issue collects ideas for checks.

If each check has a telling identifier, maybe it should be possible to disable specific tests as they can be very heavy to run (such as checking whether a multi-valued field contains only single values).

ufukyilmaz-esen commented 1 year ago

faceting on a field with indexed="false" docValues="true"

what kind of penalty does this one have?

tokee commented 1 year ago

@ufukyilmaz-esen at least on the time of writing it had severe worst-case penalties with multiple shards because filtering (i.e. "Clicking on a facet") and lookups for fine counting when running multi shard required scanning through the DocValues.

Checking...

The documentation for Solr 8.8 states indexed=false" docValues="true" as a recommendation when the field is used for faceting only. That seems somewhat ambiguous: Does "faceting only" means that one only inspects the result, but never use it for filtering?

Anyway, I'm on an extended hiatus from Solr hacking so this is as far as I'm investigating now. But thanks for the interest and the critical question. Hopefully someone else can give you a better answer.

ufukyilmaz-esen commented 1 year ago

In our schema many fields are configured with indexed=false" docValues="true". I remember I did research for a long time before deciding to do that, but can't remember how I found relevant information.

Besides filtering (for filtering we'd use a companion copyField), field is queried in refinement phase. But afaik refinement is optional, unless you pass refine: true in facet query.

Anyways this kind of critical configs should have clear explanations in documentation. Someone may index terrabytes of data before realizing he/she had to set indexed=true for his/her use case.

tokee commented 1 year ago

The internals of classic Solr faceting are really messy and it does not help that both Streaming and JSON Faceting exists as alternatives. I can understand why the documentation is lacking and had I the energy, I would try and build a non-trivial multi-shard index to test indexed="false" docValues="true".

As for refine=true, I'm fairly sure it's a JSON Faceting parameter: https://issues.apache.org/jira/browse/SOLR-7452. Classic faceting always (maybe only per default in newer Solrs?) refines. Even with indexed="true" docValues="true" this comes with quite a penalty for some searches: https://sbdevel.wordpress.com/2014/09/11/even-sparse-faceting-is-limited/