treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.48k stars 359 forks source link

Enhancement request - move or copy datasets inside a repo #6015

Open iddoavn opened 1 year ago

iddoavn commented 1 year ago

Scenario: One member of the team changes a few tables on their own branch. Then, that member wants to expose one table and that one table only (that may not be committed) to a different team member working on a different branch.

It would be good if we could use a lakectl CP, to copy a data set from one branch to another - This, of course, should be a zero clone copy. Maybe even to a different repo?

Another use case can be a lakectl MV that basically renames a data set. This is also an added capability on top of an object store where if you want to achieve something like this you need to go through a long and potentially expensive exercise of downloading and uploading data.

itaiad200 commented 1 year ago

Here are a few options that might fulfill the requirement:

  1. Commit the table in the source branch and then use cherry-pick of that commit to the destination branch.
  2. Use aws cp s3://repo/branch-source/table s3://repo/branch-dest/table. This isn't a zero-clone copy but it's not downloading and uploading the data. It's performing an object-store side copying, i.e. the copied objects never go thru the client or lakeFS itself. You can use aws mv ... to get the MV functionality, which starts with a similar copy followed by a deletion.

The option to zero-copy uncommitted objects thru lakeFS (i.e. without using a merge, commit, cherry-pick, etc.) was forfeited not long ago. The reasoning was to ensure a safe cleanup of the GC without risking data loss.

iddoavn commented 1 year ago

I think that makes a lot of sense for uncommitted data. But for committed data, it would be good to have a copy. Because a commit may include more changes than the ones you want to copy over.

Nevertheless, agree cherry pick is helpful in many cases, especially if you have a good commit hygiene.

idanovo commented 1 year ago

@ozkatz please prioritize

github-actions[bot] commented 1 year ago

This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.

github-actions[bot] commented 12 months ago

Closing this issue because it has been stale for 7 days with no activity.