Full featured bounding box options

mcwhittemore commented 8 years ago

@mick recently wrote up some options for how we can fix bounding box. I'm posting it here to help keep the convo going.

multicell in dynamo
- we delete the cell secondary index.
- our spatial index in another table, its multicell. This is outlined in more detail below
- a lambda following the main table keeps the cardboard index table up to date.
multicell outside of dynamo
- delete the cell secondary index.
- All indexing would happen in elasticsearch both for spatial and on other attributes.
- a lambda following the main table keeps the cardboard index table up to date.
something else
- another storage option? could we do better then our core limitations we already have with dynamo (write speed, full dataset reads, etc)
- this could be time sink, I think at least worth considering since it was over 2 years ago we made that choice.

mick commented 8 years ago

Expanding on multicell in dynamo

The reason for making our bbox queries simpler and slower in https://github.com/mapbox/cardboard/pull/155 was to allow for paging based on feature id, for all the features in the bbox.

We can make this bbox query much better by using a multi cell index. Before we walked away from this because there were failure cases that could result in an index that wasnt complete (and we wouldnt have a way of knowing) Now that dynamo streams is around, we have more flexibility in generating at an at least eventually consistent multicell index that will allow for paged bbox queries.

Schema changes (aka lets start with another table for spatial index)

Keep features in cardboard main table
A second table that is written to by a lambda function feed by the main tables's dynamo stream.
The second table has index by cell id + feature id and includes in the doc
- feature id
- array of cells that cover this feature
- N, E, S, W bounds (maybe used as small query filter optimization)

How BBOX queries will work:

generate tile-cover for the bbox being queried.
that translates into a list of quadkeys to query, sort the quadkeys, these queries should be done in order.
if a feature is returned by the query that has a cell in its list of covering cells that has already been queried by this bbox, skip it. This is how we dedupe.
for paging will require knowing both the last feature and the last bbox we looked at, we should return this as a iterator for the next page of results.

This involves a lightweight multicell index, and possibly many batchGets for features. This will require higher thoughput from the main cardboard table (which I think we can mitigate with better caching in from of dynamo) We probably will want to switch the index on the main table to be based on dataset id + feature id, to better utilize dynamo partitions.

We can do cache invalidations by following the dynamo stream.

rclark commented 8 years ago

We probably will want to switch the index on the main table to be based on dataset id + feature id, to better utilize dynamo partitions.

Is there a way to do this and still be able to simply list all the features in a dataset?

mcwhittemore commented 8 years ago

Just link dropping this example of geospatial querying for elastic search: http://www.elasticsearchtutorial.com/spatial-search-tutorial.html

mcwhittemore commented 8 years ago

I'm going to start working out a POC for bounding box queries via elasticsearch. I'm starting with elasticsearch over lambda and a second dynamo table as we'll need elasticsearch for other kinds of indexing anyway

mcwhittemore commented 8 years ago

I went through the process or setting up a local elastic search and testing @mick's concept above. It works very well, the only drawback is that it would mean this project would require a second dataset and a pipeline the move that data from dynamo to elastic search. I'm sure we could do this, but it would require adding a bunch more AWS services and I'm not sure we want to do that.

This morning I've been thinking a bit about what doing this with dynamo would look like. In the comment above it's suggested that we create another table and in past stabs at bbox queries we've used a second GSI, but I think we might be able to do this with one table and no GSI. This would still leave needing a pipeline, but I think we can handle that with an autoindex options param that is off by default.

The cardboard table has a HASH key called dataset and a RANGE key called id. dataset is only ever the dataset id while id is either a feature id (id!{featureid}) or the datasets metadata id (metadata!{datasetid}).

We could also add a query id (query!{type}!{name}!{value}!{featureid}).

type: bbox for bbox indexes. Maybe other indexes in time.
name: quadkey for bbox index, some other identifier for other indexes.
value: the value of the index. For bbox, this would be the value of the quadkey
featureid: the id of the feature

This would let us use an expression like #dataset = :dataset and begins_with(#id, :query) where :query looks something like bbox!quadkey!021! to find all features that are indexed at 021 or bbox!quadkey!021 to find all features that are index at or below 021.

Part of the multi-cell approach requires being able to exclude a quadkey range. I don't currently see a good way to do this other than doing post query filters. The problem here is that there is no way (please tell me I'm wrong) to do something like not begins_with(...) in any type of expression.

Another problem with this idea is that it means that there are a lot of records per feature. One per thing you want to index...

I'd love thoughts on how to go forward from here. Is adding another service to cardboard where we want to move cardboard? Maybe cardboard-search should be its own project?

If cardboard-search was its own project I think it would come with a feature-to-search tool that took a new and old version of the feature and pushed it to the database and a query tool that retrieved documents out of the database.

zmully commented 8 years ago

Please read https://aphyr.com/posts/323-jepsen-elasticsearch-1-5-0 and his previous blog on it: https://aphyr.com/posts/317-call-me-maybe-elasticsearch

I've personally experienced the majority of these failure modes, they're no fun to recover from as you typically just have to decide to STONITH and lose all the data on the node.

Also, the ES quorom protocol is flaky under any kind of subpar network conditions.

rclark commented 8 years ago

Cardboard already has serious partitioning issues with our aggregation of an entire dataset in a single HASH key.

Average usage nowhere near provisioned capacity

But we're still getting throttled requests

Adding more records per feature with the same HASH key (datasetid) is going to exacerbate this problem.

Also, keep in mind that a BBOX query via...

#dataset = :dataset and begins_with(#id, 'bbox!quadkey!021')

... only gets you features that are indexed at 021 or smaller. To find features that are indexed by a larger cell, you either need to follow that up with more searches:

#dataset = :dataset and begins_with(#id, 'bbox!quadkey!02!')
#dataset = :dataset and begins_with(#id, 'bbox!quadkey!0!')

... or you need to index large features at all the sub-cells down to some max zoom. The former approach (more queries) leads to some convoluted pagination, and the latter can lead to exponential increase is dynamodb write needs.

I'm very gungho for trying out the DynamoDB --> ElasticSearch pipeline. I'm aware that it could be a dead end, but I'd like to see that dead end before dropping the concept.

At the end of the day DynamoDB loses usefulness the more important aggregation is to your application. And it gets worse when you want to be able to locate the extremes in any aggregation. There have been a number of discussions in cardboard and other systems where I've felt that DynamoDB can make a lot of sense as scalable write storage, but the details of application querying demand some other store for aggregation and indexing on more axes that DynamoDB can realistically ($$$) support.

mcwhittemore commented 8 years ago

I'm going to do this searching as an extension to cardboard. Pushing my POC work to https://github.com/mapbox/cardboard-search.

mcwhittemore commented 8 years ago

Posts on how to setup the shards for ES

mcwhittemore commented 7 years ago

This is going to be done via the https://github.com/mapbox/cardboard-geospatial-queries repo.

mapbox / cardboard

Full featured bounding box options #178

Schema changes (aka lets start with another table for spatial index)

How BBOX queries will work: