treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.38k stars 348 forks source link

HL Python SDK: Support copy object between repositories #7447

Open N-o-Z opened 7 months ago

N-o-Z commented 7 months ago

Prerequisite: https://github.com/treeverse/lakeFS/issues/7446

Support providing a destination repo (Optional for backwards compatibility)

arielshaqed commented 7 months ago

This seems doable, but for "common" usage some interactions with garbage collection might be strange.

For externally-managed objects this is obviously easy enough to do. For an internally-managed object -- anything where lakeFS choses a path inside the repository storage namespace -- ownership makes things muddier. The source repository "owns" the object, meaning that it can garbage collect it. So if the source object is never committed it may vanish within a few days, and if committed it can vanish once it falls off of the head of all branches and according to a complex set of rules.

The API does not support zero-copy links even within a single repository because of the associated surprises. We can support cross-repository zero-copy links even without the API, sure, but they be even more surprising.