treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.48k stars 359 forks source link

Import: Remove the need for an import branch #5875

Closed N-o-Z closed 1 year ago

N-o-Z commented 1 year ago

Investigate the option to commit directly into target branch instead of creating an intermediary branch

N-o-Z commented 1 year ago

Suggestion 1: Create a dangling commit Use cherry-pick to get it to the target branch

Advantages:

  1. Removes the need for an intermediary branch
  2. Allows importing to a non-empty branch
  3. Doesn't require a merge commit

Disadvantages:

  1. Dangling Commit blocked for commit with no parents
  2. Will not work on dirty branches
  3. No tracking of changes from source - i.e. every import is treated standalone
  4. Will fail if we try to import paths that exist on the target branch and have changed

Mitigation: issue 1: Currently the graveler AddCommit has no usages so we can remove the check for parents issue 4: We could align the cherry-pick so that we will be able to overwrite existing data by passing merge strategy, which will align with the requirements in https://github.com/treeverse/lakeFS/issues/5780

@ozkatz, @nopcoder, @itaiad200, @arielshaqed - will be happy to hear your input!

nopcoder commented 1 year ago

@N-o-Z like it better than the current solution. Want to suggest improvement as I prefer to have the same without the dangling commit step. The new import produce a local database with all the data to import. I suggest to do a single side merge with this data into the final import commit. This is not keep any dangling commit and should perform faster. So, not depending on cherry-pick feature - but more on merge capability.

N-o-Z commented 1 year ago

@N-o-Z like it better than the current solution. Want to suggest improvement as I prefer to have the same without the dangling commit step. The new import produce a local database with all the data to import. I suggest to do a single side merge with this data into the final import commit. This is not keep any dangling commit and should perform faster. So, not depending on cherry-pick feature - but more on merge capability.

Great idea, but this means we'll need to expose some merge function that allows providing an itr as source instead of metarange ID

N-o-Z commented 1 year ago

Iteration implementation found out to be quite difficult due to tangled package dependencies. Decided to go with an intermediate solution:

  1. Create ranges and metarange
  2. Perform Merge on source as metarange (instead of dangling commit) using merge src-wins merge strategy