treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.38k stars 348 forks source link

Enhancement request - move or copy datasets inside a repo #6015

Open iddoavn opened 1 year ago

iddoavn commented 1 year ago

Scenario: One member of the team changes a few tables on their own branch. Then, that member wants to expose one table and that one table only (that may not be committed) to a different team member working on a different branch.

It would be good if we could use a lakectl CP, to copy a data set from one branch to another - This, of course, should be a zero clone copy. Maybe even to a different repo?

Another use case can be a lakectl MV that basically renames a data set. This is also an added capability on top of an object store where if you want to achieve something like this you need to go through a long and potentially expensive exercise of downloading and uploading data.

itaiad200 commented 1 year ago

Here are a few options that might fulfill the requirement:

  1. Commit the table in the source branch and then use cherry-pick of that commit to the destination branch.
  2. Use aws cp s3://repo/branch-source/table s3://repo/branch-dest/table. This isn't a zero-clone copy but it's not downloading and uploading the data. It's performing an object-store side copying, i.e. the copied objects never go thru the client or lakeFS itself. You can use aws mv ... to get the MV functionality, which starts with a similar copy followed by a deletion.

The option to zero-copy uncommitted objects thru lakeFS (i.e. without using a merge, commit, cherry-pick, etc.) was forfeited not long ago. The reasoning was to ensure a safe cleanup of the GC without risking data loss.

iddoavn commented 1 year ago

I think that makes a lot of sense for uncommitted data. But for committed data, it would be good to have a copy. Because a commit may include more changes than the ones you want to copy over.

Nevertheless, agree cherry pick is helpful in many cases, especially if you have a good commit hygiene.

idanovo commented 1 year ago

@ozkatz please prioritize

github-actions[bot] commented 10 months ago

This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.

github-actions[bot] commented 10 months ago

Closing this issue because it has been stale for 7 days with no activity.