treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.41k stars 350 forks source link

Lakectl Diff Support For Checking If Only Metadata Changed #8113

Open farhanhubble opened 1 month ago

farhanhubble commented 1 month ago

Currently, the diff command in Lakectl reports files as "modified" even if only their metadata has changed. While this could be useful for some applications, it is inefficient for data pipelines. The diff command should show, per file, whether the changes are to the contents, metadata, or both. LakeFS server does show this information with "identical size". The status command should have a similar feature and local commit should allow committing files filtered by change type.

kujenga commented 1 month ago

This would be a welcome change! My expectation was that lakectl local would work similarly to how git works, in terms of tracking status of chnages, and what is considers to be a diff. In that paradigm, changing the modification time on a file tracked by lakectl should not be relevant for detecting file changes.

I'd asked about this in Slack and was directed here: https://lakefs.slack.com/archives/C016726JLJW/p1725669972042319

jameshod5 commented 1 month ago

Just want to comment and say that this is something I have been trying to figure out as well! I agree with OP, if we can filter out commit to allow for certain changes, that would improve our personal workflow a lot.

arielshaqed commented 1 month ago

Context

It seems that the relevant section of docs is Limitations / Warnings. It was added here.

Clarifications

There are multiple issues here:

So I would like the third decision to be selectable (by lakectl local config and possibly also an override commandline flag), rather than us making a decision in advance. Otherwise, whichever we pick will break some use-cases.

arielshaqed commented 1 month ago

@farhanhubble about this:

it is inefficient for data pipelines

I completely understand why checking mtime is inefficient for data pipelines - lakectl local will re-upload the file when you don't need it. Now if we ignore mtimes, how would we determine that the object changed? One way to do so would be to fingerprint or similarly digest the object, say into CityHash or even some SHA. Since we're now talking about efficiency, we should talk numbers before we go off and digest an entire subdirectory of unchanged files. So... How many objects are involved? What sizes? What is the total size of all objects?

arielshaqed commented 3 weeks ago

For user metadata, #8251 might make this more important. :-)

(But no direct user metadata support in local.)