mikemccand / luceneutil

Various utility scripts for running Lucene performance tests
Apache License 2.0
203 stars 113 forks source link

Add more faceting benchmark coverage #137

Open gsmiller opened 3 years ago

gsmiller commented 3 years ago

It looks like the current faceting-focused benchmark tasks are focused on "browsing" (i.e., match all docs style facet counting). It would be nice to extend the faceting benchmark coverage to sparse hit cases (e.g., matching a relatively low number of docs relative to the corpus). We might also want to consider tasks where matching documents have a large number of ordinals (I think the current tasks are such that hits only participate in a small (one?) number of faceting ordinals).

rmuir commented 3 years ago

+1 in addition to MatchAllDocsQuery, we could maybe reuse the same searches (e.g. HighTerm/LowTerm) with faceting.

mikemccand commented 3 years ago

+1, I think the current tasks language allows faceting on any tasks? So hopefully the first item (faceting on "normal" queries) is just a matter of adding a few such tasks to the default and nightly tasks file.

For the second idea (increasing cardinality of facet fields) I'm less sure :) Does Wikipedia have some sort of taxonomy label?

mikemccand commented 3 years ago

Maybe we could randomly select a few terms from the document and add them as facet labels? It's not entirely natural, but it should do a decent job with a Zipfian distribution, having a long tail of more rare labels.

rmuir commented 3 years ago

rather than adding fake terms, could we consider something like a slightly truncated date? pick something like the last-modified-date, zero out the seconds, minutes, hours, whatever it takes to get the desired cardinality. should it not be realistic to facet on?