BKD-based shape indexing benchmarks [LUCENE-8452]

mikemccand commented 5 years ago

Initial benchmarking of the new BKD-based shape indexing suggest that searches can be somewhat under-performing. I open this ticket to share the findings and to open a discussion how to speed up the solution.

The first benchmark is done by using the current benchmark in luceneutils for indexing points and search by bounding box. We would expect LatLonShape to be slower that LatLonPoint but still having a good performance. The results of running such benchmark in my computer looks like:

LatLonPoint:

89.717239531 sec to index

INDEX SIZE: 0.5087761553004384 GB

READER MB: 0.6098232269287109

maxDoc=60844404

totHits=221118844

BEST M hits/sec: 72.91056132596746

BEST QPS: 74.19031323419311

LatLonShape:

89.388678805 sec to index

INDEX SIZE: 1.3028179928660393 GB

READER MB: 0.8827085494995117

maxDoc=60844404

totHits=221118844

BEST M hits/sec: 1.0053836784184809

BEST QPS: 1.0230305276205143

A second benchmark has been performed indexing around 10 million 4-side polygons and around 3 million points. Searches are performed using bounding boxes. The results are compared with spatial trees alternatives. Spatial trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:

s2 (Geo3d):

1191.732124301 sec to index part 0

INDEX SIZE: 3.2086284114047885 GB

READER MB: 19.453557014465332

maxDoc=12949519

totHits=705758537

BEST M hits/sec: 13.311369588840462

BEST QPS: 4.243743434150063

quad (JTS):

3252.62925159 sec to index part 0

INDEX SIZE: 4.5238002222031355 GB

READER MB: 41.15725612640381

maxDoc=12949519

totHits=705758357

BEST M hits/sec: 35.54591930673003

BEST QPS: 11.332252412866938

LatLonShape:

30.32712009 sec to index part 0

INDEX SIZE: 0.5627057952806354 GB

READER MB: 0.29498958587646484

maxDoc=12949519

totHits=705758228

BEST M hits/sec: 3.4130465326433357

BEST QPS: 1.0880999177593018

Legacy Jira details

LUCENE-8452 by Ignacio Vera (@iverase) on Aug 09 2018, updated Aug 14 2018 Attachments: BKDperf.pdf, Lake.png, Park.png, River.png

mikemccand commented 5 years ago

I think one of the keys to understand the performance of BKD-based shape indexing is to analyse how 6-d bounding boxes are transformed into 2-d bounding boxes. Our input is a 6d point representing the minimum value and 6d point representing the maximum value of a bounding box and we need to transform it into a 2d bounding box to check the relationship with the query shape. The logic to get the 2d bounding box goes as follow:

In a 6-d points we have three dimensions corresponding to latitudes (0, 2, 4) and the other three for longitude (1,3,5). To build the 2d minimum point of the bounding box from the 6d minimum point, we compute the lat and lon using the minimum value of the dimensions for the corresponding lat/lon entries. To build the 2d maximum point of the bounding box from the 6d maximum point, we compute the lat and lon using the maximum value of the dimensions for the corresponding lat/lon entries.

For example: We have the following bounding box in 6d and the corresponding bounding box in 2d:

Min: (0,0,0,0,0,0) -> (0,0)

Max: (90,90,90,90,90,90) -> (90,90)

Split first dimension in half:

Left tree:

Min: (0,0,0,0,0,0) -> (0,0)

Max: (45,90,90,90,90,90) -> (90,90)

Right tree:

Min: (45,0,0,0,0,0) -> (0,0)

Max: (90,90,90,90,90,90) -> (90,90)

Using left tree, split third dimension in half :

Left-Left tree:

Min: (0,0,0,0,0,0) -> (0,0)

Max: (45,90,45,90,90,90) -> (90,90)

Left-Right tree:

Min: (45,0,45,0,0,0) -> (0,0)

Max: (90,90,90,90,90,90) -> (90,90)

Using left tree, split fifth dimension in half :

Left-Left-Left tree:

Min: (0,0,0,0,0,0) -> (0,0)

Max: (45,90,45,90,45,90) -> (45,90)

Left-Left-Right tree:

Min: (45,0,45,0,45,0) -> (45,0)

Max: (90,90,90,90,90,90) -> (90,90)

We have to split all three dimension for latitude to actually have an effect on the 2-d bounding box. Which means that we need to visit many nodes of the tree before we actually filter out some of the branches.

@nknize, does it make sense to you?

[Legacy Jira: Ignacio Vera (@iverase) on Aug 09 2018]

mikemccand commented 5 years ago

Thanks for putting this together @ivera. Its a great start to benchmarking BKD with data higher than two dimensions. In fact, I don't think we've ever fully benchmarked past two dimensions, and the current POINTS implementation supports up to eight. For this particular geo usecase it would be nice to put together some shape benchmarks with more representative real world data, along with different relation queries, and add to the nightly geo benchmarks. A good representative set I've been looking at for some time is the osm data which contain a fair number of linestrings, closed linestrings, and polygons with number of vertices much higher than 4.

This first benchmarking step certainly highlights the deficiency of the BKD split and merge algorithms for high dimension data. Its a side effect of their naivety; which was chosen simply for ease of implementation. And indexing LatLonShape with only point and bounding box data is going to highlight those deficiencies quite spectacularly - if only due to the dimensional expansion issue (especially with point shapes). There are a lot of optimizations in the literature that could (and should) be applied for well past two dimensions. For example x-tree has some nice heuristics for avoiding the sliver problem that occurs much more frequently in high dimension data. And we could also investigate different encoding methods for LatLonShape 's tessellated triangles. For now, it would be good to capture performance across various datasets and carefully note in the javadocs what tradeoffs should be considered for certain datasets (i.e., don't use LatLonShape if you're only indexing/searching points). In the meantime, we can investigate best approaches for strengthening the BKD split/merge algorithms for the general high dimension use-case (similar to what was done for the 1d usecase) and possibly revisit the idea of a new codec derived from BKD that is intended for the higher dimension usecase.

[Legacy Jira: Nick Knize (@nknize) on Aug 09 2018]

mikemccand commented 5 years ago

What it actually concerns me is that the dimensions are not independent. That has a big impact for the case of points only as the splitting only has effect when all the dependent dimensions have been splitted.

Out of curiosity, we have benchmarks for the 3d case for geo3d. I did run an experiment to see how the BKD tree behaves for different dimensions by running the geo benchmark for different number of points and dimensions. Attached are the resulting plots. The experiment was not run in a control environment so it needs to be interpreted with care.

The most interesting plot is the size of the reader which shows the different implementation in case of 1d and in the case of Nd (N>1).

[Legacy Jira: Ignacio Vera (@iverase) on Aug 09 2018]

mikemccand commented 5 years ago

the dimensions are not independent.

That's what I mean by "...due to the dimensional expansion issue (especially with point shapes)." For point shapes, LatLonShape does the opposite of Principal Component Analysis. Instead of reducing dimensionality it expands it from two to six and subsequently perfectly correlates dimensions 1, 3, 5 and 2, 4, 6. It is not at all optimal and not at all expected to perform anywhere close to a structure optimized for true two dimension point space. In fact it goes against every concept of a optimal structure.

to see how the BKD tree behaves for different dimensions

As I mentioned, the split/merge algorithms for BKD are naive for ease of implementation. Given the lack of heuristics along dimensions I'd expect QPS/HPS to drop as more dimensions are added. Add a high level of correlation between dimensions and I'd expect it to get even worse. Certainly, for high dimension indexing, work will need to be done to optimize the BKD structure. Alternatively, we can work on a separate codec (possibly a R*/X/PR-tree hybrid) derived from the current BKD implementation that is designed for higher dimension space - since the kd structure itself was not designed for such applications. And subsequently restrict BKD to two dimensions only.

[Legacy Jira: Nick Knize (@nknize) on Aug 09 2018]

mikemccand commented 5 years ago

Maybe we can fold these new shape benchmarks into luceneutil so it runs nightly?

[Legacy Jira: Michael McCandless (@mikemccand) on Aug 10 2018]

mikemccand commented 5 years ago

Before moving it into luceneutil I would like to agree in the format for the benchmark datasets. I guess we have two options:

1 ) WKT: org.apache.lucene.geo.Polygon only supports GeoJSON, therefore we would have to add a SimpleWKTPolygonParser to org.apache.lucene.geo package similar to SimpleGeoJSONPolygonParser. WKT seems a more compact structure.

2) GeoJSON: Maybe the price of the verbosity of JSON is not too bad.

[Legacy Jira: Ignacio Vera (@iverase) on Aug 10 2018]

mikemccand commented 5 years ago

For WKT, there's already parsing in Spatial4J, a dependency of spatial-extras.... so the jar's should be there for the lucene-util benchmark to pick up.

This isn't to say Lucene itself should or shouldn't parse WKT but that's a new JIRA for sure and needn't delay the benchmarks.

[Legacy Jira: David Smiley (@dsmiley) on Aug 10 2018]

mikemccand commented 5 years ago

Thinking about this, we can make the format configurable with a flag.

[Legacy Jira: Ignacio Vera (@iverase) on Aug 10 2018]

mikemccand commented 5 years ago

What it actually concerns me is that the dimensions are not independent.

bq. we could also investigate different encoding methods for LatLonShape 's tessellated triangles

Maybe we could make them independent, eg. by indexing (x1, y1, x2 - x1, y2 - y1, x3 - x1, y3 - y1) instead of (x1, y1, x2, y2, x3, y3)? In addition to making dimensions independent, it would also have the nice property that an index that only has points should perform almost as well as LatLonPoint since it would partition the space exactly the same way.

LUCENE-7862 is also a simple change that I would expect to help a lot with such high numbers of dimensions, especially when there is correlation.

[Legacy Jira: Adrien Grand (@jpountz) on Aug 13 2018]

mikemccand commented 5 years ago

+1 @jpountz I'm toying around with that approach a bit and can post some benchmark numbers when I have them.

As a side note (that may be of interest) I went ahead and extracted all linestrings, multilinestrings, and multipolygons from the latest planet OSM snapshot to run some local scale benchmarks and general tests with real world shape data. I converted the data from .pbf to WKT for easy ingest in luceneutil (and already have a WKT parser for LatLonShape - lines and polygons - that I can commit to luceneutil separately if interested). The data is quite large, and very good (real world w/ varying spatial extents, vertex counts, etc). If there is any interest I can extract a smaller set (e.g., 60M shapes to complement the 60M points in geobench) and make available for geo nightly benchmarks.

Here are the numbers for the entire corpus of data:

Type	Count	File Size
`LINESTRING`	157,075,680	88GB
`MULTILINESTRING`	532,043	7.1GB
`MULTIPOLYGON`	351,975,024	164GB

Here are three simple examples of the type of shape data contained in the planet OSM corpus (river, lake, and park polygons):

[Legacy Jira: Nick Knize (@nknize) on Aug 13 2018]

mikemccand commented 5 years ago

This looks like a nice dataset for benchmarking indeed!

[Legacy Jira: Adrien Grand (@jpountz) on Aug 14 2018]

mikemccand commented 5 years ago

we could make them independent, eg. by indexing (x1, y1, x2 - x1, y2 - y1, x3 - x1, y3 - y1) instead of (x1, y1, x2, y2, x3, y3)

+1, it looks promising...

already have a WKT parser for LatLonShape - lines and polygons - that I can commit to luceneutil separately if interested

Is there any reason not to add this utility to Lucene? It looks to me it would be very useful. (Note: We currently have the class SimpleGeoJSONPolygonParser which it seems to me it was added to support the GeoBenchmark for points).

I can extract a smaller set (e.g., 60M shapes to complement the 60M points in geobench)

Awesome, but note that with 60M shapes we might not be able to compare the performance with spatial trees because it probably takes too long to index the data.

[Legacy Jira: Ignacio Vera (@iverase) on Aug 14 2018]

mikemccand / stargazers-migration-test

BKD-based shape indexing benchmarks [LUCENE-8452] #451

Legacy Jira details