Closed mdmarshmallow closed 2 years ago
Yeah, we would also need to wait until LUCENE-10250 is merged as well to use this :). But as for the dataset you are correct that it is always 4 deep. There were ~64M entries from I think around 10 or 11 different states. Around ~52M entries were unique as well so this is very high cardinality. A deeper description of the dataset can be found here: https://www.transportation.gov/gis/nad/nad-schema
…fields, files added in this commit are not accurate benchmarks
Wrote a script that reads the NAD database (can be downloaded here: https://www.transportation.gov/gis/national-address-database/national-address-database-nad-disclaimer), then indexes and runs some basic faceting tests. This is not an accurate benchmark, but can probably be used as the basis for a real high cardinality faceting benchmark in the future. It also serves a test to make sure that faceting is still able to be used with high cardinality facets even if benchmark timing is not accurate.
Also needs to be used in conjunction with the SSDV hierarchical field changes: https://github.com/apache/lucene/pull/509