treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.41k stars 350 forks source link

Move or copy datasets inside a repo #7418

Open talSofer opened 8 months ago

talSofer commented 8 months ago

Created at

2023-11-10T17:46:02.000Z

Priority

priority:unknown

PRD

https://github.com/treeverse/lakeFS/issues/6015

Feature Definition

T-shirt size

Related feature requests

oliverdain commented 6 months ago

I just had to re-organize some data. Specifically, moving all the assets from one directory to a different directory. On the local file system it was a simple mv dir_a sub/dir/ and ran essentially instantly. But, to do that I had to (1) lakectl local clone part of my repo (2) do the move and (3) lakectl local commit. In all that took well over and hour as it was a few hundred GB of data. I believe it should be possible to do this in O(1) via lakectl fs mv lakefs://.../dir_a lakefs://.../sub/dir/ without needing to clone or re-upload.

oliverdain commented 4 months ago

Note: I tried the aws s3 mv route as described in the original feature request and it works but is incredibly slow. It have about 4TB of data I'd like to move (just change directory path) and it appears that's going to take about 5 hours.

Update: it ran for about 12 hours and then started randomly failing on some files. Attempts to retry failed on the same files.

I think doing a rename like this server side is probably a single SQL UPDATE query since the metadata is all stored in Postgres and the actual data blobs don't need to move at all.

arielshaqed commented 4 months ago

Technical response on why things are the way they are

Details of internals, feel free to skip!

This comment:

I think doing a rename like this server side is probably a single SQL UPDATE query since the metadata is all stored in Postgres and the actual data blobs don't need to move at all.

is really important in order to understand why it's non-trivial. Because it is almost entirely true! Initial versions of lakeFS supported moves.

The thing is, an RDBMS such as Postgres won't scale to desired performance levels. ACID transactions would make this so much simpler! But reads become about as slow as writes - the database must ensure that there is no hazard or similar from a concurrent write.

Instead we use a key-value store for lakeFS staging. That database model only allows for single-key concurrency, obviously here we would need consistency across 2 keys. The actual place where we lose is that lakeFS cannot safely garbage-collect uncommitted objects.

arielshaqed commented 4 months ago

Actual question about requirements

While we cannot safely accommodate all scenarios, perhaps there are some scenarios which we could?

As an example, committed branch HEADs are immune to garbage collection, so it may be possible to support a rename operation from a committed object on branch head to a new name on that same branch. And other scenarios may also be possible.

With that in mind, users could help us narrow down the scope! Right now I am asking where you can limit:

I am looking for a way to limit what sources are allowed, in some way which will allow us to add a safe API. The worry if course is that such an API will be unusably complex.

Of course this is not a design, and we might not be able to accommodate your important scenarios. I would like to understand actual requirements in detail.

oliverdain commented 4 months ago

Details of internals, feel free to skip!

Thanks! I appreciate having a better understanding of why this is hard.

Can the source always be a committed object? (For a move that obviously implies that that object is on the branch head.)

Yes, I think that's fine.

If uncommitted, can the source always be new-ish, and this not eligible for garbage collection?

I think only comitted is fine but I can see a case for recent uncommitted objects off a branch head (e.g. "oops - that was a typo" for a newly created object)

Is it acceptable for the operation to try to move safely, but fail if a safe move is impossible?

Yes, especially if the error message is clear. Like "Unable to safely move but we can always safely move committed object. Try comitting to a branch and then moving".

I would like to understand actual requirements in detail.

In my particular case it's a simple case of data organization. We put a bunch of files in /some/directory/path and then realized a better organization would be /some/directory/with-intermediate/path or /other/location/entirely. On a local file system my use case (so far) has pretty much always boiled down to a single mv of a directory and that directory was always something that was comitted to the HEAD of a branch.

Thanks for considering this!

oliverdain commented 5 days ago

Has there been any more progress on this? I one again have a few TB of data I need to move. Again, it's just moving one directory into another one and it's on HEAD. As far as I know there still isn't a decent solution for this.