Open farhanhubble opened 1 month ago
This would be a welcome change! My expectation was that lakectl local
would work similarly to how git works, in terms of tracking status of chnages, and what is considers to be a diff. In that paradigm, changing the modification time on a file tracked by lakectl should not be relevant for detecting file changes.
I'd asked about this in Slack and was directed here: https://lakefs.slack.com/archives/C016726JLJW/p1725669972042319
Just want to comment and say that this is something I have been trying to figure out as well! I agree with OP, if we can filter out commit to allow for certain changes, that would improve our personal workflow a lot.
It seems that the relevant section of docs is Limitations / Warnings. It was added here.
There are multiple issues here:
lakectl local
to work. I believe that this is the correct decision.lakectl local
does: some UNIX commands will break if local ignores mtime. For instance, all versions of Make are driven by mtime. So if lakectl local
ignores mtime, it will break some Makefiles. I believe that there is no one correct decision here.So I would like the third decision to be selectable (by lakectl local config and possibly also an override commandline flag), rather than us making a decision in advance. Otherwise, whichever we pick will break some use-cases.
@farhanhubble about this:
it is inefficient for data pipelines
I completely understand why checking mtime is inefficient for data pipelines - lakectl local will re-upload the file when you don't need it. Now if we ignore mtimes, how would we determine that the object changed? One way to do so would be to fingerprint or similarly digest the object, say into CityHash or even some SHA. Since we're now talking about efficiency, we should talk numbers before we go off and digest an entire subdirectory of unchanged files. So... How many objects are involved? What sizes? What is the total size of all objects?
For user metadata, #8251 might make this more important. :-)
(But no direct user metadata support in local.)
Currently, the diff command in Lakectl reports files as "modified" even if only their metadata has changed. While this could be useful for some applications, it is inefficient for data pipelines. The diff command should show, per file, whether the changes are to the contents, metadata, or both. LakeFS server does show this information with "identical size". The status command should have a similar feature and
local commit
should allow committing files filtered by change type.