Closed mcwhittemore closed 7 years ago
Expanding on multicell in dynamo
The reason for making our bbox queries simpler and slower in https://github.com/mapbox/cardboard/pull/155 was to allow for paging based on feature id, for all the features in the bbox.
We can make this bbox query much better by using a multi cell index. Before we walked away from this because there were failure cases that could result in an index that wasnt complete (and we wouldnt have a way of knowing) Now that dynamo streams is around, we have more flexibility in generating at an at least eventually consistent multicell index that will allow for paged bbox queries.
This involves a lightweight multicell index, and possibly many batchGets for features. This will require higher thoughput from the main cardboard table (which I think we can mitigate with better caching in from of dynamo) We probably will want to switch the index on the main table to be based on dataset id + feature id, to better utilize dynamo partitions.
We can do cache invalidations by following the dynamo stream.
We probably will want to switch the index on the main table to be based on dataset id + feature id, to better utilize dynamo partitions.
Is there a way to do this and still be able to simply list all the features in a dataset?
Just link dropping this example of geospatial querying for elastic search: http://www.elasticsearchtutorial.com/spatial-search-tutorial.html
I'm going to start working out a POC for bounding box queries via elasticsearch. I'm starting with elasticsearch over lambda and a second dynamo table as we'll need elasticsearch for other kinds of indexing anyway
I went through the process or setting up a local elastic search and testing @mick's concept above. It works very well, the only drawback is that it would mean this project would require a second dataset and a pipeline the move that data from dynamo to elastic search. I'm sure we could do this, but it would require adding a bunch more AWS services and I'm not sure we want to do that.
This morning I've been thinking a bit about what doing this with dynamo would look like. In the comment above it's suggested that we create another table and in past stabs at bbox queries we've used a second GSI, but I think we might be able to do this with one table and no GSI. This would still leave needing a pipeline, but I think we can handle that with an autoindex
options param that is off by default.
The cardboard table has a HASH key called dataset
and a RANGE key called id
. dataset
is only ever the dataset id while id
is either a feature id (id!{featureid}
) or the datasets metadata id (metadata!{datasetid}
).
We could also add a query id (query!{type}!{name}!{value}!{featureid}
).
This would let us use an expression like #dataset = :dataset and begins_with(#id, :query)
where :query
looks something like bbox!quadkey!021!
to find all features that are indexed at 021
or bbox!quadkey!021
to find all features that are index at or below 021
.
Part of the multi-cell approach requires being able to exclude a quadkey range. I don't currently see a good way to do this other than doing post query filters. The problem here is that there is no way (please tell me I'm wrong) to do something like not begins_with(...)
in any type of expression.
Another problem with this idea is that it means that there are a lot of records per feature. One per thing you want to index...
I'd love thoughts on how to go forward from here. Is adding another service to cardboard where we want to move cardboard? Maybe cardboard-search
should be its own project?
If cardboard-search
was its own project I think it would come with a feature-to-search
tool that took a new and old version of the feature and pushed it to the database and a query
tool that retrieved documents out of the database.
Please read https://aphyr.com/posts/323-jepsen-elasticsearch-1-5-0 and his previous blog on it: https://aphyr.com/posts/317-call-me-maybe-elasticsearch
I've personally experienced the majority of these failure modes, they're no fun to recover from as you typically just have to decide to STONITH and lose all the data on the node.
Also, the ES quorom protocol is flaky under any kind of subpar network conditions.
Cardboard already has serious partitioning issues with our aggregation of an entire dataset in a single HASH key.
Average usage nowhere near provisioned capacity
But we're still getting throttled requests
Adding more records per feature with the same HASH key (datasetid) is going to exacerbate this problem.
Also, keep in mind that a BBOX query via...
#dataset = :dataset and begins_with(#id, 'bbox!quadkey!021')
... only gets you features that are indexed at 021
or smaller. To find features that are indexed by a larger cell, you either need to follow that up with more searches:
#dataset = :dataset and begins_with(#id, 'bbox!quadkey!02!')
#dataset = :dataset and begins_with(#id, 'bbox!quadkey!0!')
... or you need to index large features at all the sub-cells down to some max zoom. The former approach (more queries) leads to some convoluted pagination, and the latter can lead to exponential increase is dynamodb write needs.
I'm very gungho for trying out the DynamoDB --> ElasticSearch pipeline. I'm aware that it could be a dead end, but I'd like to see that dead end before dropping the concept.
At the end of the day DynamoDB loses usefulness the more important aggregation is to your application. And it gets worse when you want to be able to locate the extremes in any aggregation. There have been a number of discussions in cardboard and other systems where I've felt that DynamoDB can make a lot of sense as scalable write storage, but the details of application querying demand some other store for aggregation and indexing on more axes that DynamoDB can realistically ($$$) support.
I'm going to do this searching as an extension to cardboard. Pushing my POC work to https://github.com/mapbox/cardboard-search.
This is going to be done via the https://github.com/mapbox/cardboard-geospatial-queries repo.
@mick recently wrote up some options for how we can fix bounding box. I'm posting it here to help keep the convo going.