Implement GC - Githubissues

BatmanAoD commented 10 months ago

Until this is implemented, is there a way to completely eliminate data from the .jj repository, such as an inadvertently-snapshotted key file or target directory? (For colocated repos, I can of course delete .jj itself, but I'd rather not.)

martinvonz commented 10 months ago

I'm afraid not, at least not without a lot of manual steps. You'd need to manually roll back to an earlier operation and manually remove any refs/jj/keep/* keeping the unwanted commits alive. It's honestly not that complicated to implement GC. Let us know if you're interested and we can provide some guidance. I actually thought I had already described the basic idea here, but I clearly haven't, so let me at least do that. Here is what I had in mind:

Allow pruning old operations from the operation log. This involves teaching algorithms that walk the operation log to behave reasonably when an older operation is missing. For example, jj op log needs to not crash. I don't know if we should record pruned operation IDs so we can distinguish that case from a broken operation store.
Implement GC by walking the operation log and collecting all commits pointed to by refs and anonymous heads. The pass that set into some new Backend::gc(&self, commits: &[CommitId]) function. The GitBackend will then remove refs/jj/keep/* refs not pointing into that set. It will then call Git's normal GC (which might mean subprocessing to git if neither libgit2 nor gitoxide supports GC).

arxanas commented 10 months ago

If someone implements GC, it would also be a good time to "compact" the keep refs (assuming that jj doesn't already do that). Some Git operations scale with the number of refs, so having thousands of jj/keep/* refs can slow them down. As part of GC, we could calculate the set of heads of the live commits and create/preserve only those refs, while deleting all other jj/keep/* refs.

ilyagr commented 10 months ago

The current workaround for the refs building up is running git pack-refs --all occasionally. This is mentioned offhandedly in https://martinvonz.github.io/jj/v0.11.0/git-compatibility/#co-located-jujutsugit-repos. I'm guessing (but haven't checked) that after packing, the extraneous jj/keep refs take up disk space, but might not actually slow things down much.

It'd be obviously nicer if jj could do it itself and it was combined with GC.

arxanas commented 10 months ago

@ilyagr I mention it because, at my work, a certain Phabricator script invokes git for-each-ref --contains X, which hangs because I have 100k (packed) git-branchless refs 😄.

martinvonz commented 10 months ago

I'm guessing (but haven't checked) that after packing, the extraneous jj/keep refs take up disk space, but might not actually slow things down much.

I think they are the cause of #293.

We can even create a single commit with all visible heads as parents and attach a ref to that commit. That way we only need one ref ot preserve all commits.

yuja commented 10 months ago

For the record, jj/keep/* refs are now excluded from git::import_refs(). git pack-refs --all will help import large number of tags and branches, though.

I think they are the cause of https://github.com/martinvonz/jj/issues/293.

I hope it will be mitigated by gitoxide. Last time I checked, libgit2 unpacked commit object for each ref, which was the major CPU overhead of jj git fetch. OTOH, git fetch is reasonably fast.

martinvonz commented 10 months ago

I thought the problem was with "ref advertisement", which protocol v2 skips (https://git-scm.com/docs/protocol-v2), but I have done anything to confirm that.

ilyagr commented 10 months ago

I think they are the cause of https://github.com/martinvonz/jj/issues/293.

Good point, good to know. ~I had a feeling that git pack-refs helps with this, but I'm not entirely sure.~ (Yuya just posted something much more informed about this point above.)

We can even create a single commit with all visible heads as parents and attach a ref to that commit. That way we only need one ref ot preserve all commits.

On one hand, I think it's clever.

On the other hand, it reminded me of the following piece of code we recently added, which is quadratic in the number of parents a commit has. It made me wonder how many various git tools might have code like that. In other words, I'm worried some tools won't be optimized to deal smoothly with commits that have hundreds-thousands of parents.

https://github.com/martinvonz/jj/blob/b37293fa683c87d70742c654d2c8ee56fbf752c7/lib/src/rewrite.rs#L388-L399

It also might be fine, I'm unsure.

yuja commented 10 months ago

I thought the problem was with "ref advertisement", which protocol v2 skips (https://git-scm.com/docs/protocol-v2),

iirc, libgit2 unpacks commits to sort refs by committer timestamp (and exclude non-commit refs), then picks up first ~500 refs or something. I don't know which phase this function belongs to. I think the git CLI implements more modern features, so git fetch vs (libgit-based) jj git fetch isn't 1:1 comparison of the same logic.

martinvonz commented 10 months ago

On the other hand, it reminded me of the following piece of code we recently added, which is quadratic in the number of parents a commit has.

When I reviewed that, I was also wondering to myself how long it'll take before someone runs into that :)

In other words, I'm worried some tools won't be optimized to deal smoothly with commits that have hundreds-thousands of parents.

I suspect it's rare that tools inspect unusual refs like refs/jj/keep/* to start with, so it's probably not going to happen often. And if it does happen, it should be fixed in the tool. We can create such commits as a tree with maximum 100 parents or something if we're worried about it.

jonathantanmy commented 10 months ago

I thought the problem was with "ref advertisement", which protocol v2 skips (https://git-scm.com/docs/protocol-v2),

iirc, libgit2 unpacks commits to sort refs by committer timestamp (and exclude non-commit refs)

This is also done by Git, not for sorting, but for finding a cutoff date (calculated from the remote refs) to determine what commit dates would be considered "recent". See how the cutoff variable in mark_complete_and_common_ref() in fetch-pack.c is used.

Also note that I don't think libgit2 has protocol v2, so there is no reduced list of refs.

ilyagr commented 10 months ago

(🤔 out loud)

Following the discussion in https://github.com/martinvonz/jj/pull/2659, I thought of an approach to GC that seems a little naive and simplistic, but also seems conceptually nice (assuming it's correct). Unfortunately, it only makes git log --all problem discussed in that PR worse.

For each operation in the op log, we can have a Git merge commit. Its first parent will be the merge commit for the previous operation. The rest are the heads of the view for this operation (i.e., the commits needed to preserve this view from garbage collection). The merge commit for the latest operation gets a jj/keep ref.

Now, garbage collection can be thought of as a rebase: pick one of these marge commits and rebase it so that the merge commit for the previous operation is no longer a parent. Now, "normal" git gc will do more or less the right thing.

Update: The content of the commits is not important here, so this is more of a "verbatim rebase" AKA "reparenting" from #1027.

Similarly, this could look like a rebase in the operation log (create one operation for all the changes up to operation X-1 and rebase operation X onto that). I think operation log ids are randomly generated, so we can consider them as change ids. (We could also have commit ids for them).

Alternatively, we could even not do any sort of GC on the operation log, but teach jj that a commit can be in 3 states rather than the current 2: visible, hidden, or unavailable (AKA garbage-collected, though it could also be appropriate for partial/shallow clones).

jonathantanmy commented 10 months ago

(🤔 out loud)

Also thinking out loud...

Following the discussion in #2659, I thought of an approach to GC that seems a little naive and simplistic, but also seems conceptually nice (assuming it's correct). Unfortunately, it only makes git log --all problem discussed in that PR worse.

For each operation in the op log, we can have a Git merge commit. Its first parent will be the merge commit for the previous operation. The rest are the heads of the view for this operation (i.e., the commits needed to preserve this view from garbage collection). The merge commit for the latest operation gets a jj/keep ref.

Now, garbage collection can be thought of as a rebase: pick one of these marge commits and rebase it so that the merge commit for the previous operation is no longer a parent. Now, "normal" git gc will do more or less the right thing.

This means that when we GC, we cannot go back to all previous operations anymore - don't we want to save some (say, from the past few weeks)? I guess we can solve this by making a commit that refers to the heads of all the operations we want to save (maybe using Martin's giant merge commit idea, which also means that if/when we develop a way to simplify a list of commits into only the commits necessary to refer to all of them, we can use that optimization when creating the giant merge commit), then when rebasing the latest merge commit, we not only remove one parent, but add this newly-made commit as one of its parents too.

Alternatively, we could even not do any sort of GC on the operation log, but teach jj that a commit can be in 3 states rather than the current 2: visible, hidden, or unavailable (AKA garbage-collected, though it could also be appropriate for partial/shallow clones).

This sounds like it would solve it from the jj side, but on the Git side, the commits are still referred to and cannot be GCed.

Having said that, being able to store metadata about commits seems like a useful idea (e.g. going off the partial/shallow clone idea, we could say that we know about this commit because a child commit was fetched from such-and-such remote on such-and-such day) but currently we don't have a design about how to send/receive this data and if we do, whether we need it to have a hash (not sure if it's needed at such an early stage, though).

ilyagr commented 10 months ago

This means that when we GC, we cannot go back to all previous operations anymore - don't we want to save some (say, from the past few weeks)?

Yes, but I didn't think we want to save all of them. In the model I proposed, choosing the cutoff point is simply choosing the merge commit for which operation is "rebased/reparented" (see also the new note I added to my previous comment about reparenting. I just mean that the contents of the merge commit is irrelevant, only the parent).

I'll also mention that there's only a minor difference between Martin's huge merge commits and my tree of merge commits; my version simply creates less humongous (but still large) merge commits and makes some of them children of others to achieve the same effect that Martin achieves. OTOH, in Martin's version, only one humongous merge commit is relevant at any one time.

Alternatively, we could even not do any sort of GC on the operation log, but teach jj that a commit can be in 3 states

This sounds like it would solve it from the jj side, but on the Git side, the commits are still referred to and cannot be GCed.

Good point, I guess in my model the GC of the op log has to correspond to the GC of the actual commits in those operations. I think that might be OK, especially in the first version. It could alternatively be worked around. We could remember some (or all) of the older operations without keeping the corresponding "merge commits" around, so the commits for those older operations can be GCed.

martinvonz / jj

Implement GC #12