martinvonz / jj

A Git-compatible VCS that is both simple and powerful
https://martinvonz.github.io/jj/
Apache License 2.0
8.32k stars 284 forks source link

lib: do not snapshot dirs containing `CACHEDIR.TAG` #4409

Closed thoughtpolice closed 3 weeks ago

thoughtpolice commented 3 weeks ago

The "Cache Directory Tagging Specification"[0] is a (fairly old) specification for a specific file, named CACHEDIR.TAG, to be created within a directory, with some exact contents, in the event that:

This specification is used by tools like cargo for its target/ directory and buck2 for its buck-out/ directory, among many others.

This solves a very practical problem I kept running into, where:

This exact problem happens with Cargo too if you don't have a properly specified .gitignore file. It's pretty bad to snapshot GiBs of small files only to fall over at the end anyway due to file size guardrails.

Alternative solutions to all this aren't that good:

While not every tool uses CACHEDIR.TAG, the ones that do (buck2, cargo) almost certainly do so exactly for these reasons. So we should just exclude these files.

[0] https://bford.info/cachedir/

Checklist

If applicable:

PhilipMetzger commented 3 weeks ago

I have no particular like for special-casing something like this, as it doesn't cover all the build systems available (I did not find any code for it in Bazel) and presents inconsistent behavior in the long run. If this was a generalize-able setting which wasn't covered by .ignore (#3525), I'd maybe accept it. It also is covered by #4338.

thoughtpolice commented 3 weeks ago

I don't buy either of these complaints.

I don't think "It doesn't work identically for everything" is a very good answer at all, it does something very useful for many things right now. Bazel could write this file tomorrow, Buck did it recently. There is plenty of software that acknowledges or writes these files, it is a bit niche but not completely and totally out there. We automatically snapshot (potentially large) directory hierarchies automatically. This whole idea is very much intended for software like ours, it is not a "special case" for us. It's literally a case designed for us! It's by definition not special at all, to me.

I also don't think it's "inconsistent" in the slightest. The software wrote this file because it thinks the contents should be ignored, because it's re-creatable junk. So, we ignore it. What's inconsistent about that? "Why did this directory get ignored with Buck, but not Bazel?" "Because Buck wrote a file we automatically recognize, and Bazel didn't." That's not inconsistency at all. The point is that the contents aren't valuable. It's not like we're acknowledging it in some cases and not others based on the day of the week. If other software wants files to be ignored, it can write the file out too! I would like more specific complaints than this.

I also don't buy #4338 as a solution. I don't want to manually add files, I like auto tracking! It's very good. In fact, this patch only makes it behave even more usefully for me by ignoring junk it shouldn't have added anyway. It is exactly because I like auto tracking, and do not want to turn it off, that I want this feature. Shoehorning me into a totally different core UX behavior for this one thing is absolutely not acceptable IMO and tantamount to an "X is Y" thing.

And I suspect that people who use manual-add probably still won't add these files ever. What's the point of making life harder on people who do want auto-add?

Now I agree that people should add things to ignore files. But the reality is that it's very easy to run a tool for example where it creates things you didn't expect, or run into a case like mine where a .gitignore entry exists in one branch but not another and you switch, etc. The point is to avoid these sharp edges assuming the tools cooperate.

The more important criteria for a patch like this is, what cost there is to actually be implement this behavior, and what downsides for users can it have, not "can it solve everything." In my opinion, the answer to those questions is "very little" and "nearly none" respectively. Actually, there is a problem with this patch based on the tests: that you can't specifically track a file underneath the CACHEDIR.TAG file, if you really wanted to. OK, that can probably be fixed, though I think it's so totally marginal and unlikely that we could just do it only when needed. (Similarly, maybe it could be a toggled flag. But really, I'm not keen on turning a simple 5 line patch that can be easily moved or expanded into a 20 line one over a hypothetical, frankly, because it's just another pointless knob.)

thoughtpolice commented 3 weeks ago

I would also like to posit this simple observation: this whole problem could be solved without our involvement if Buck itself simply wrote out a file called buck-out/.gitignore that just contained the contents * — gitignore files are not special scriptures that can only be written by humans, they can be also written by programs too, just like CACHEDIR.TAG files. We wouldn't even try to track this file. How would you know about it? My point is that inconsistency arises from the application of these files, not whether humans or programs generated or respect them.

OK, why doesn't buck do that? Well, it could do that. Maybe it should do that, even (it would certainly make things less error prone! Hell, why not write out hgignore too?) But one reason it doesn't is simple: because Git isn't the only program in the world, and other software scans those directories, and probably wants to know it's useless junk! So writing this helps those files. It could write out more files, too, I guess.

So, I just don't think this is something that can be really perfect or wholly subsumed and "fixed" by one feature. It's just one of those things where every participant has to actually... participate... with the idea to it useful. It's approximate and always will be.

ilyagr commented 3 weeks ago

Perhaps we can eventually treat files in such dirs similarly to files >1MB, e.g. inform the user that there are un-ignored files that probably shouldn't be snapshotted?

This presumes that we solved the current sharp edges of how we treat these kinds of files. (I'm not sure how close we are to that) As is, it may or may not be worth it to implement it now, I'm not sure.

Assuming Git does not auto-ignore such dirs, it feels like silently not snapshotting them would eventually lead to weird corner cases and user confusion.

Add a .gitignore file under buck-out/? Really?

This is what came to my mind, as the best option for now.

OK, why doesn't buck do that? Well, it could do that. Maybe it should do that, even (it would certainly make things less error prone! Hell, why not write out hgignore too?) But one reason it doesn't is simple: because Git isn't the only program in the world, and other software scans those directories, and probably wants to know it's useless junk! So writing this helps those files. It could write out more files, too, I guess.

I think it should do that. Yes, it might eventually have to write several files like this, but that's not the end of the world. I don't see a downside. (But also, I don't think Buck is wrong for not doing it, it's just an nice feature it could have).

thoughtpolice commented 3 weeks ago

Yes, I realized now that this won't work so well with colocation, because git status will still show the files. I guess for now I can just upstream my other patch (/buck-out in gitignore), but I'm going to think about this a bit more because it frankly still sucks, IMO.

jgilchrist commented 3 weeks ago

Just to make sure as it's not listed under alternative options - there is also ~/.config/git/ignore, which is similar to .git/info/exclude but works globally. Same upsides and downsides as .git/info/exclude, but does at least mean it only needs to be configured once.

thoughtpolice commented 3 weeks ago

Closing this because I don't see a way forward with the colocation bug.