mikemccand / luceneutil

Various utility scripts for running Lucene performance tests
Apache License 2.0
205 stars 115 forks source link

Added basic test to verify that faceting works with high cardinality … #156

Closed mdmarshmallow closed 2 years ago

mdmarshmallow commented 2 years ago

…fields, files added in this commit are not accurate benchmarks

Wrote a script that reads the NAD database (can be downloaded here: https://www.transportation.gov/gis/national-address-database/national-address-database-nad-disclaimer), then indexes and runs some basic faceting tests. This is not an accurate benchmark, but can probably be used as the basis for a real high cardinality faceting benchmark in the future. It also serves a test to make sure that faceting is still able to be used with high cardinality facets even if benchmark timing is not accurate.

Also needs to be used in conjunction with the SSDV hierarchical field changes: https://github.com/apache/lucene/pull/509

mdmarshmallow commented 2 years ago

Yeah, we would also need to wait until LUCENE-10250 is merged as well to use this :). But as for the dataset you are correct that it is always 4 deep. There were ~64M entries from I think around 10 or 11 different states. Around ~52M entries were unique as well so this is very high cardinality. A deeper description of the dataset can be found here: https://www.transportation.gov/gis/nad/nad-schema