kite-sdk / kite

Kite SDK
http://kitesdk.org/docs/current/
Apache License 2.0
394 stars 262 forks source link

CDK-409: support for immutable partitions #333

Closed rbrush closed 9 years ago

rbrush commented 9 years ago

Pull request for CDK-409: https://issues.cloudera.org/browse/CDK-409

rbrush commented 9 years ago

Some build failures are happening on Travis that didn't occur locally. I'm looking into it.

Edit: the failure was an issue in the test, which is now corrected.

rbrush commented 9 years ago

In case anyone is looking at this, I think I will tweak the Dataset.merge logic further before we want to actually pull this into master. Right now each partition will appear atomically during a merge, but I'd like to further strengthen this so if an update contains a parent with multiple children partitions, and the target doesn't have the parent partition, then they all appear atomically.

This would help for partitions like ///, where all partitions under a // would appear atomically. (We have use cases where this would be useful.)

rdblue commented 9 years ago

I'll take a look at this as soon as I can. Thanks, Ryan!

rdblue commented 9 years ago

Right now each partition will appear atomically during a merge, but I'd like to further strengthen this so if an update contains a parent with multiple children partitions, and the target doesn't have the parent partition, then they all appear atomically

I think this is critical to the feature. In fact, I don't think it is safe to add immutable partitions unless all show up with one atomic move. Otherwise you could end up in a wedged state where some partitions created by an import job successfully showed up, but one doesn't. Then the job can't be re-run.

tomwhite commented 9 years ago

One way of merging multiple partitions is to check that all of the renames would succeed (by checking that none of the targets already exist) before doing any of the actual renames. This "optimistic merge" is what Hive does for its immutable partition handling, so Kite would be consistent with that.

rbrush commented 9 years ago

This pull request does check if all renames would work as it stands, but of course since there isn't a way to rename multiple directories atomically it's still possible we'd have a failure after some were renamed, but others weren't. The only way around this ugly partial success scenario would be to restrict writers to only write to a single partition in a job (and hence a single, atomic rename to "commit" the changes). I don't think we want to be that restrictive, though, so we might have to use a best effort approach: do all of the exists checks before any renames, and do the merge in the minimal number of rename operations. While partial success would still be possible, we limit it to failures between rename operations.

I added a comment to the JIRA to this effect in case it's best to have the discussion over there:

https://issues.cloudera.org/browse/CDK-409

rdblue commented 9 years ago

Thanks, good to know what Hive is currently doing. I like the plan to make sure the entire operation is expected to succeed before starting the merge. That should take care of most cases.

Then how should we handle the case where some data already exists after an attempt fails? It would be nice to have an option to make the operation succeed and copy just the directories that didn't make it the first time. With that turned on or another option, maybe we should move the conflicting directories to hidden paths that could be cleaned up later? Something like .day=10.rejected

tomwhite commented 9 years ago

@rbrush I'm not sure how I missed that, but I'm glad to see that you've implemented it like that.

rbrush commented 9 years ago

I had rebased to create a clean request, but the previous version is preserved here: https://github.com/rbrush/kite/tree/CDK-409-old

This iteration is likely to change as well based on discussion on the JIRA (https://issues.cloudera.org/browse/CDK-409), but I'll leave it up for the time being for reference.

rbrush commented 9 years ago

Closing this pull request per discussion on CDK-409. We'll be taking a different approach here.