Split cardboard up across two tables

mcwhittemore commented 8 years ago

The goal of this PR is to resolve hot partition problems as outlined in https://github.com/mapbox/cardboard/issues/184.

TODO

Make sure we are only making calls to search when a feature is created/deleted
Add batch back
Fix inline docs
Update api.md
Drop test support for older versions of node
Add test support for newer versions of node

Things changed

Moved to geobuf@3.0.0 which requires node@4.5.0
changed createTable to createTables and removed name overrides.
changed list with no callback to listStream. If we want to keep the old functionality, we can have list call listStream, I just wanted to clean up the code paths in that function.
no mas bbox
changed batch.remove to batch.del to match the non-batch api

mcwhittemore commented 8 years ago

right now I'm just ripping batch out, we will want to come back and add it back again

mcwhittemore commented 8 years ago

OK. This is getting pretty close. I think metadata and adding bulk back is all that is left before doing some benching with cardboard-hammer.

mcwhittemore commented 8 years ago

Might want to add tests to make sure that both tables are getting cleaned out when a feature is deleted.

mcwhittemore commented 7 years ago

meh... good batch is going to be hard to write. The more I think about this, the more I think we should either drop batch (bad idea) or always do batch operations. Having two main files that write to two tables conditionally does not sound very maintainable. For now, I'm going to go with stubbing batch to call the single feature methods many times. Once we've confirmed that this is fast enough and scales well, I'll start converting to a 100% batch api.

/cc @rclark

mcwhittemore commented 7 years ago

Yesterday, @rclark suggested that we break the writes to the main-index away from the writes to the list-index (called, features and search in this PR).

@rclark is suggesting this because we cannot guarantee that two writes happen two different dynamo tables. If we first write to main-index and then write to list-index one of them fails and the other succeeds, we can't guarantee that the needed delete will work. And if this needed delete fails we are stuck in a situation where our indexes don't agree with each other. This state would always break the list action and would sometimes break the get action.

@rclark's suggestion is to move the list-index write to a DynamoDB stream handler. DynamoDB streams guarantee order and does not expire an event until it is handled successfully (or 24 hours pass), so the write to the list-index will try over and over again if something is wrong with that table.

The main problem I see with going to this method is that it will make deleting all of a datasets data complicated. At best you'd need a stream handler on list-index the made possible noop requests to the main-index.

Another option is to suggest users of cardboard use an out of sync delete process. This would be some external task that looks at for metadata records in cardboard and compares them to a master database that is not managed by cardboard (cardboard doesn't currently manage a master list of dataset). By finding datasets that have metadata records but are not in the users master database they would know that dataset needs to be cleaned up.

I think cardboard's modules would all need to offer a clean up function. This doesn't get around the problem of needing cardboard.list to delete the features in main-index, but it does start to solve that problem.

Anyway, this needs more thought but I do agree that streams solve the guarantee problem so we should continue to understand these extra problems. My biggest fear is that moving to streams forces a bunch of overhead.

mcwhittemore commented 7 years ago

In talking with @rclark, he pointed out that we can avoid two streams by always deleting from the main-index first. Let's call not seeing that option 👶-brain.

mcwhittemore commented 7 years ago

Replaced by #186

mapbox / cardboard

Split cardboard up across two tables #185