newren / git-filter-repo

Quickly rewrite git repository history (filter-branch replacement)
Other
8.52k stars 708 forks source link

[Feature request] Convert blobs into promisors #404

Closed gab closed 2 years ago

gab commented 2 years ago

Partial clone with --filter=blob:none (or some size threshold) is starting to look like a serious alternative to LFS, but one of the big obstacles is that the local history will grow over time with no way to get rid of older revisions of blobs. The only cleanup solution so far is re-cloning, but doing so while avoiding the loss of local branches, tags, stashes, notes and so forth is a complicated proposal.

Introducing a new feature for that in git is likely to take a lot of iterations while they settle on the design, and there aren't a lot of developments around this so far. So git-filter-repo could do a lot by providing a way to essentially revert the repo to freshly partial cloned status, re-applying the filter to convert existing blobs back into promisors. Just like a partial clone, the operation would be non-destructive to the history, with commit hashes remaining the same.

Bonus points if you provide an option to avoid needlessly flushing blobs that are currently checked-out, to avoid a re-download from the server 👍 .

newren commented 2 years ago

This request has nothing to do with rewriting history; it's only about pruning downloaded blobs. So, it doesn't fit in the design space of filter-repo.

I think this belongs in core git. See https://lore.kernel.org/git/20221012135114.294680-1-christian.couder@gmail.com/ for a proposal.

Also, one way to avoid the growth of history is to do a better job of limiting operations to the sparse specification in sparse checkouts. See https://lore.kernel.org/git/pull.1367.v3.git.1665269538608.gitgitgadget@gmail.com/ for that side of things.

gab commented 2 years ago

The first proposal you mention is already merged as far as I can tell. I did test fetch --refetch, too, but it doesn't clean existing blobs. It must be for switching to less strict filters only.

The second discussion you mention is unrelated to partial blobless clones, it's more about making sure everything related to excluded folders in sparse checkouts stays excluded from the local repo - as far as I can tell anyways.

Too bad for the "out of scope" answer. Your tool, your call, but I would've guessed that with the infrastructure you have in place, the operation would've been trivial.

newren commented 2 years ago

The first proposal you mention is already merged as far as I can tell.

Nope, it's certainly not. You can verify that by going to a clone of git.git, fetching, and running git log -S--filter -p --all -- builtin/repack.c to verify that string has never occurred in that file.

Given that someone else has the exact same problem description as what you did, and provided patches to implement it in git.git, doesn't that suggest there's already a solution for you? If you want it, perhaps try out those patches and report on how well they work for you (and whether they meet your needs or you have a slightly different usecase they might want to consider also addressing)? Now's a good time to do that since the patches are still under consideration and have not been merged.

I did test fetch --refetch, too, but it doesn't clean existing blobs. It must be for switching to less strict filters only.

Yeah, that's unrelated. That's for getting more information, without letting the server know about what you have already downloaded. The new repack --filter in the first proposal is the thing that implements what you're asking for.

The second discussion you mention is unrelated to partial blobless clones, it's more about making sure everything related to excluded folders in sparse checkouts stays excluded from the local repo - as far as I can tell anyways.

Folks often use partial clones and sparse-checkouts together, and there's a desire to make them work even better together. While that discussion is about sparse-checkouts, it does specifically discuss blobless partial clones in more than one place. If you were using both, then the stuff in that document would be highly relevant to this request of yours.

But, you are right that the two features can be used independently and if you aren't using sparse-checkouts, then the stuff in that discussion isn't relevant to you.

Too bad for the "out of scope" answer. Your tool, your call, but I would've guessed that with the infrastructure you have in place, the operation would've been trivial.

I'm not sure where you are getting that from. filter-repo does not directly write to the git history, it only does so via invoking git commands, particularly git fast-export and git fast-import. Many features and capabilities in filter-repo were only possible via first patching git itself to add new capabilities. And if I were to add something to do what you suggest here, it'd be done the same way -- meaning, git would first need to have some kind of capability I could invoke to do that. But, in this case, since the git feature is doing precisely the thing you ask for, I don't see what value there would be in having filter-repo wrap it. filter-repo exists because typical rewrites need a whole bunch of glue and programmability around their features to make a functional tool; filter-repo is that glue. In this case, no glue is needed whatsoever; the proposed patches provide a flag to git doing exactly what you are requesting as far as I can tell.

Anyway, hope that helps.

gab commented 2 years ago

Thanks for the detailed answer. It does shed a light on the current state of things and the relationship of your tool with git.