Open martinvonz opened 2 years ago
Filters are supported in libgit2, but not in git2.rs
. It's also apparently not very hard to implement the logic ourselves. See https://github.com/rust-lang/git2-rs/issues/442.
Also, the list of files that use lfs is stored in .gitattributes
, so this is related to https://github.com/martinvonz/jj/issues/53.
I'm worried about clean/smudge in general because it seems expensive to make it behave consistently. https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes says that smudging happens just before checkout and cleaning happens just before staging. If that's correct, then that seems to mean that git diff
on a commit that changes an LFS file would just show the changed metadata, which is not very user-friendly. On the other hand, if it's not correct, and smudging happens whenever you need to present the file to the user, then there are ugly corner cases to deal with, like where .gitattributes
changes but subtrees remain the same (then you'd technically need to diff the whole tree recursively to tell if it actually changed). I think I asked someone on the Git team at Google about this and they said Git just ignores the corner case(s).
there are ugly corner cases to deal with, like where .gitattributes changes but subtrees remain the same
(Update: On second thought, this paragraph might not really be addressing your point) Yes, I remember setting up the LFS repo being a pain. I don't remember how git reacted to changing .gitattributes
, but it took a while to get right; my intention is to never change the setup (which directories LFS is used for). To make this possible, I have a repository just for LFS, separate from my main dotfiles repository.
For reference, the setup looks like this:
$ cat .gitattributes
.local/bin/* filter=lfs diff=lfs merge=lfs -text
(I use Github's LFS support to sync a few binaries across my machines. I use stow
to symlink to the git repo's .local/bin from my real ~/.local/bin)
My sense is that only the filter=lfs -text
part is crucial. The mapping from filter=lfs
to actual git-lfs
commands to run for cleaning/smudging happens inside the git config.
If that's correct, then that seems to mean that git diff on a commit that changes an LFS file would just show the changed metadata, which is not very user-friendly.
I think that's what the diff
and merge
gitattributes are for, but they don't seem to have much of an effect now:
$ git diff -r HEAD^
diff --git a/.local/bin/hwatch b/.local/bin/hwatch
index acc9c52..5e4ed26 100755
--- a/.local/bin/hwatch
+++ b/.local/bin/hwatch
@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
-oid sha256:c0e1ce1ee1a4841f2df5b7721bbeefe7d91f12bfe77b0922c9ac708f93379f2b
-size 7989912
+oid sha256:4dbaaf94bfd0812a38e5eb9bc01ec0f22f0953efc13be768aa310deb6b5982ce
+size 7912656
This works pretty well for LFS specifically. I'm not sure what else, if anything, clean/smudge filters are used for.
This does make LFS a little different from the way jj
treats conflicted files.
This might all be quite awkward with jj
's auto-rebasing. OTOH, the level of awkwardness would be similar to what happens if one tracks a binary file in a jj repository normally (without LFS), which is something we should probably eventually improve if we can (I'm not sure how).
Note that LFS is widely disliked due to half-baked support from GitHub (small quotas, made worse by being consumed by both third-party forks and CI activity) and even less support elsewhere. I hope jj can offer a first-class solution to large binary file handling eventually, and that any LFS compat is forwards-compatible with that.
git diff on a commit that changes an LFS file would just show the changed metadata, which is not very user-friendly
What else might they display? A diff of a large binary file will rarely be intelligible, even with a suitable diff algorithm (e.g. based on rolling hashes). I suppose it would be cool for a Sufficiently Smart diff viewer to be able to e.g. display two versions of a .png, but in general large binary files are opaque.
git diff on a commit that changes an LFS file would just show the changed metadata, which is not very user-friendly
What else might they display? An actually diff of a large binary file will rarely be intelligible
Fair enough :) But the point still stands for clean/smudge in general.
As I mentioned, there seemed to be some attempt to support intelligent diffing with LFS. I would guess that the original idea was for the user to configure custom diff tools (e.g. for pngs) that would be called by git-lfs diff
or something like that.
Note that LFS is widely disliked
Regardless of LFS's merits, I imagine there's plenty of other users like me that would love to use jj in an existing repo that uses git lfs, but currently can't. Whether I use jj should be opaque to others in the repo, so this wouldn't be a compelling reason to migrate the codebase away from lfs (to whatever solution may be better) in a many-user repo.
I agree, support for LFS would be mostly (maybe only) to make it easier to use jj with existing git repos.
Is there a way to ignore LFS files so that LFS-enabled repos can still use jj, even if it means we can't interact with LFS itself while using jj? Just wondering if there's a way to use jj without needing full support for LFS.
Is there a way to ignore LFS files so that LFS-enabled repos can still use jj, even if it means we can't interact with LFS itself while using jj? Just wondering if there's a way to use jj without needing full support for LFS.
If you don't need the LFS files, then it probably already works - you'd just see the placeholder files (pointers to the real content) in the working copy, I think. But I suspect that you need the actual files, and for that I can't think of a good solution.
Oh, using sparse checkouts in a colocated repo might work. However, sparse checkouts don't currently support negative patterns, so it could be very annoying to maintain the sparse patterns depending on your repo. Hmm, it also looks like we don't have any documentation about sparse checkouts, other than jj help sparse
. Run jj sparse set --clear --add <path prefix> --add <path prefix> ...
. If you realize it's unmaintainable, run jj sparse set --reset
to include all paths in the working copy again.
So sparse
seems like an inverse .gitignore
(in that it's a list of path inclusions only)? Unfortunately like you said, this would probably only work for a sizable repo if it supported negative patterns, if I have it correct that --remove
only removes an existing inclusion, rather than actually adds a negative matching pattern.
This solution would have worked great especially if it supported negative globs so that I could just add my existing LFS globs in my .gitattributes
(ex. **/snapshots/**/*.png
) into jj sparse
.
We do want to add support for arbitrary patterns in jj sparse
. If you have time to spare, I think a good start would be to add a new GlobMatcher
in https://github.com/martinvonz/jj/blob/main/lib/src/matchers.rs.
Then we'd also need to figure out the UX for adding and removing patterns. Git seems to use the same format as for .gitignores
(https://git-scm.com/docs/git-sparse-checkout). It seems that you can add paths to the list with git sparse-checkout add <pattern>
, but I didn't find a command to modify the list. Maybe you need to manually edit .git/info/sparse-checkout
for that. Also, Git has something called "cone mode". I hope we can avoid exposing something like that to the user.
Instead of patterns being "prefixes" as they are now, can't we allow arbitrary globs in jj sparse set (--add|--remove)
in terms of UI? Then you could add, remove and list patterns (rather than paths) from the CLI.
Reading comments in this issue, it seems like jj sparse
could make it easier to work with Git LFS and replace git update-index --skip-worktree
(which I've been using a lot recently), so I could also chip in and try to help bring this.
Edit: adding/removing globs in the command line is (as the git docs mention) error-prone. I personally think this is fine as long as we add warnings about it. An alternative would be a jj sparse edit
command which brings up $EDITOR
on a temporary file with one pattern per line. After saving the file, jj
parses the saved file as patterns and saves them wherever/however it wants.
Edit 2: the git docs mention that having "non-cone" globs can slow down commands, but (naively) this seems like it could be solved by compiling patterns similarly to what globset
does.
Instead of patterns being "prefixes" as they are now, can't we allow arbitrary globs in
jj sparse set (--add|--remove)
in terms of UI? Then you could add, remove and list patterns (rather than paths) from the CLI.
If we accept both things like docs/
and **/Cargo.toml
, then the issue becomes how to tell which is which. Is docs
a file called that or should it match recursively? We could "solve" that by saying that globs are also recursive, so the glob docs
also matches all files anywhere under that directory, but I think that will make e.g. **/Cargo.toml
confusing, because you would probably not expect that to match lib/Cargo.toml/foo
. That's probably not much of an issue in practice for sample path (no one creates a Cargo.toml
directory), but there are probably other examples that do happen in practice. Mercurial solves the problem by allowing a prefix to specify what kind of pattern it is. We can just copy that solution.
Reading comments in this issue, it seems like
jj sparse
could make it easier to work with Git LFS and replacegit update-index --skip-worktree
(which I've been using a lot recently), so I could also chip in and try to help bring this.
That would be appreciated, thanks! Just to be clear, jj sparse
makes jj completely ignore the non-sparse paths, so you will have to rely on git to populate and update those paths.
FYI, the typical use case for sparse checkouts is when you're working on only a small part of a large repo, like working only on a particular file system in the Linux repo (I have never worked in the Linux repo, so I have no idea if that's a realistic example - maybe you need most of the repo in order to do a build anyway, for example).
Edit: adding/removing globs in the command line is (as the git docs mention) error-prone. I personally think this is fine as long as we add warnings about it. An alternative would be a
jj sparse edit
command which brings up$EDITOR
on a temporary file with one pattern per line. After saving the file,jj
parses the saved file as patterns and saves them wherever/however it wants.
Yes, I think that would be useful. I was looking for git sparse-checkout edit
when I was typing my previous message :)
Edit 2: the git docs mention that having "non-cone" globs can slow down commands, but (naively) this seems like it could be solved by compiling patterns similarly to what
globset
does.
There are still cases that can't be made fast, like **/Cargo.toml
, for example. We need to visit every file in the repo to see if it matches. We can add a warning if the user adds a pattern like that. There are less clear cases like some/dir/**/Cargo.toml
, which may be cheap if some/dir/
is small. So maybe what we want to do is to apply the pattern, then check how many paths we visit and how many paths match, and warn if < 1% match or something. But that might be going too far :) So maybe we warn exactly when a pattern is not a pure prefix pattern (i.e. exactly what git's cone mode allows).
For posterity, there are other legit uses of Git's clean/smudge filters beyond LFS: https://github.com/elasticdog/transcrypt
Oh, I didn't realize that jj
erases all files not in jj sparse list
; I thought it would simply ignore them (the way Git ignores files with --skip-worktree
). Then I'm not sure jj sparse
is the way to go (for my particular use case).
With that said, I actually implemented jj sparse set --edit
before I realized this, so I can submit a PR for it (and even if it's not submitted, it can serve as future reference).
Oh, I didn't realize that jj erases all files not in jj sparse list; I thought it would simply ignore them
This confused me. My impression was the opposite: jj is only allowed to touch (or erase) files in jj sparse list
, and should ignore files not in jj sparse list
. Did I miss something?
Oh, I didn't realize that jj erases all files not in jj sparse list; I thought it would simply ignore them
This confused me. My impression was the opposite: jj is only allowed to touch (or erase) files in
jj sparse list
, and should ignore files not injj sparse list
. Did I miss something?
I think @71 meant when you go from having some part of the workspace populated to having that part not populated, then the jj sparse set
command will remove those paths. For example, if you do jj git clone <the jj repo itself>
and then jj sparse set --clear --add src
, then all of lib
, docs
etc. will be removed.
@71, if you make git populate those paths after setting the sparse patterns with jj sparse
, does that work for you?
In my case, the problem was that I had files that I did not want in the Git repo.
jj git clone ... && cd ...
echo abc > abc
jj sparse set --clear --add src
At this point abc
does not exist in the working copy anymore, but jj st
shows it. I didn't realize that the whole directory had been removed, did jj untrack abc
, and lost the file. I can also use jj sparse set --add abc
after 3. to recover abc
, but am not sure how to recover abc
through Git (without adding it back into the repo).
3. but am not sure how to recover
abc
through Git (without adding it back into the repo).
Do you mean that you want it as an untracked file? You can do jj cat abc > abc
(jj cat abc
reads the content from a commit, and default to reading it from the working-copy commit).
Regardless of LFS's merits, I imagine there's plenty of other users like me that would love to use jj in an existing repo that uses git lfs, but currently can't.
I would like to echo this sentiment: Beloved or not, LFS is fairly widely used to track binary files in repos for various reasons, and missing support for it excludes a large amount of repositories from use with jj. It's a very promising sign that "mundane" user compatibility concerns like this are taken seriously :+1: thanks for that, and thanks for jujutsu!
To add some colour to this, I think it's not necessary that jj support the full set of Git LFS features (like smudge, clean, etc), only that jj be able to interop gracefully in a colocated Git LFS repo. IMO it's okay to say this out of scope for jj
, then rely on git lfs checkout
to smudge the files, and git commit
to commit cleaned files.
If so, then perhaps all that's needed is for jj
to just ignore files that are named in .gitattributes
, i.e.:
git lfs checkout
IIUC.There will be some unhappy cases when the .gitattributes
file changes, and some files that used to be ignored aren't ignored any more (and vice-versa), but it should work most of the time.
Apologies if this was suggested earlier, I tried to digest the discussion so far as best as I could.
Git LFS seems to be used frequently enough that it may be worth adding support for it. I don't think it'll be a priority for me very soon, but I guess that depends on how many people want it.
The specification says that it uses clean/smudge filters. We don't have anything like that yet. So the first step is probably to add support for that. We could make the filters only available internally (i.e. in Rust code) to start with to keep it simple. On the other hand, it might not be hard to make them user-configurable.
Another option is to add a separate file type in the data model for LFS entries. For reference, we currently have files, symlinks, trees, conflicts, and gitmodules. I haven't thought through the consequences yet. There should be no difference to the user and no difference in the representation when using the Git backend. However, clean/smudge filters are probably useful to have anyway. Oh, one possible advantage of representing LFS entries in the model is that we can decide to always leave merged LFS files as conflicts, without downloading the files until the user checks them out or looks at the diff etc.
I don't yet know what other aspects of Git LFS we need to consider.
Originally requested in #77.