newren / git-filter-repo

Quickly rewrite git repository history (filter-branch replacement)
Other
8.34k stars 701 forks source link

Option to not bake out shallow graft commit parent? #234

Open DougLeonard opened 3 years ago

DougLeonard commented 3 years ago

I don't know if filter-repo is just using what git gives it or how hard this would be, but I have a use case where I need shallow, as in .git/shallow , commits with unreachable parents to not have their parent sha's removed.

The use case is that I've implemented improvements on the subtree method that, among other things, avoid duplicates without sqaushing by re-incorporating the remote commits onto cleanly branched local history (without rewriting that history). To do this, when pulling from a subtree it matches up the pulled split subtree commits with split versions of commits from the local tree, and uses that to remap (replace-ref) the commits in the pulled branch back onto the local history, of course burning it all in with git-filter-repo. However I don't want to force the user to first rebuild their local history and bake out pre-existing replace refs or grafts for that to work. So I need the pulled branch to also not bake those out or they won't match up, nor will their children. I can deal with standard replace refs just by stashing them while doing the rewrite. But I don't know how to bypass rewriting shallow grafts. A bonus would be dealing with .git/info/grafts too.

The solution must be automatable. Any time a branch includes one or more of these commits, it needs to just work, and I can't disable following all replace refs either.

A slightly more detailed explanation is here:

https://gitlab.com/douglas.s.leonard/alltrees/-/issues/8

present behavior:
when filtering a branch that includes shallow commits with unreachable parents, git-filter-repo turns them into standard parentless commits.

desired behavior:
Option to rebuild the history maintaining the unreachable parent ids and their entry in .git/shallow.

secondary goal:
Do the same for .git/info/grafts commits.

newren commented 3 years ago

Hello! I read your other email a while back and started reading over your project. Sorry for not responding. It sounds interesting, even if I don't have a personal use.

For this issue, filter-repo doesn't actually do the graft or replace ref handling; it's all done inside fast-export. As per the git-replace(1) manual, you can set the GIT_NO_REPLACE_OBJECTS environment variable to avoid replace refs being used. That might be easier than stashing the replace refs. I don't know if that affects .git/info/grafts, though. (It might...but does anyone even use the antiquated grafts mechanism anymore?)

As for interaction with .git/shallow, I'd never even considered that possibility before. I played around, but it looks like this issue is entirely inside of git:

First, with the following setup:

git init -b main foobar
cd foobar
seq 1 10 >numbers
git add numbers 
git commit -m initial
seq 1 15 >numbers 
git add numbers 
git commit -m more
seq 1 20 >numbers 
git add numbers 
git commit -m "even more"

cd ..
git clone --depth 1 file://$(pwd)/foobar/ foobaz

I then see the following from git log:

$ git log --parents --oneline
d45f8a3 e613acc (HEAD -> main, origin/main, origin/HEAD) Even more
e613acc (grafted) more

In other words, git-log says that commit e613acc, our second commit, has NO parents. That's simply not true:

$ git cat-file -p HEAD~1
tree 1d8d95b09437010bf189e103c9ce48bfa318aa97
parent 4bc06d1d918efd7d3dc82e6ab4c9de0eb6508800
author Elijah Newren <newren@gmail.com> 1617410478 -0700
committer Elijah Newren <newren@gmail.com> 1617410478 -0700

more

That shows the parent right there -- 4bc06d1d9. The fast-export and log commands both use the same revision walking machinery, so naturally the log issue spills over into fast-export:

$ git fast-export --all --reference-excluded-parents --no-data
reset refs/heads/main
commit refs/heads/main
mark :1
author Elijah Newren <newren@gmail.com> 1617410478 -0700
committer Elijah Newren <newren@gmail.com> 1617410478 -0700
data 5
more
M 100644 97b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd numbers

commit refs/heads/main
mark :2
author Elijah Newren <newren@gmail.com> 1617410608 -0700
committer Elijah Newren <newren@gmail.com> 1617410608 -0700
data 10
Even more
from :1
M 100644 0ff3bbb9c8bba2291654cd64067fa417ff54c508 numbers

reset refs/remotes/origin/main
from :2

Note that the first commit in the stream has no "from" reference, which is how parents are specified in the fast-import stream.

There is a tangential similarity here to how fast-export used to handle negative revision specifications (see commits https://git.kernel.org/pub/scm/git/git.git/commit/?id=530ca19c02 and https://git.kernel.org/pub/scm/git/git.git/commit/?id=af2abd870b), but those cases differed in that the log machinery would have shown the right thing, it was just that fast-export hadn't exported the "parent" object and thus could use an integer "mark" reference to something earlier in the stream. In this case, fast-export is probably being lied to about what the parents are. The fact that the log output has the "grafted" annotation, though, suggests there is a way we can append and use extra info somehow.

So, ultimately, this is something that is going to need to be fixed in git. I'm way oversubscribed on the number of upstream git things on my plate, but if you wanted to report it to the git mailing list and/or work on a fix, that'd be appreciated. Otherwise, I'll just leave this issue open and might eventually get a chance to come back to it.

DougLeonard commented 3 years ago

Thanks for the very detailed reply. No worries about responses. I'm not paying much for your tool. I guess the situation is about what I suspected, as implied by "dealing with whatever git gives it."

This maybe just isn't worth the trouble in the end, but maybe I'll look a step deeper, we'll see. The whole point of a subtree-like system, and the reason I'm using it, is that regular developers don't even need to know it exists, so it's not a really strong constraint that the maintainer who does sync the trees needs a full ungrafted repo, or well, not usually, ... I think. It's not for me and my small projects anyway. I was just trying to tie off a loose end.

One possible work-around I thought of is to try to detect these commits (you can detect them just from their .git/shallow entry) and use your --depth option to cutt-off rewriting at their children. I'm not sure if that works. Since there can be multiple children of said commits at different depths from the filter reference, hacking up a solution like that would be pretty ugly at best.

Anyway, you can call it "won't fix" or leave it open and add a "when pigs fly" milestone to it, or whatever makes it not cluttering your todo list too much. The situation is understood, at least in generalities. Thanks for looking at it.

DougLeonard commented 3 years ago

Oh regarding the standard replace refs, that's a nice tip. However, I do need replace refs active during the re-write to repatriate the split commits. I build those in reverse with filter-repo first and then flip them. I just need to not have the pre-existing replace refs active, or only sometimes. That's not quite just a matter of stashing files either, but anyway, those tools are already in place.

DougLeonard commented 3 years ago

I did find this:

https://stackoverflow.com/questions/44112593/how-to-get-parent-of-specific-commit-in-git

And in the second answer by Nayagam specifically addresses shallow commits with this solution to find parents: ''' git cat-file -p commit_id '''

This does seem, with a little parsing, to get all parents. So in principle if there is no "from" in the fast-export stream, it's possible to check for the presence of the present commit sha listing in .git/shallow, and then if that's also there, find the parents from this and inject them? I really have no familiarity with details of how fast-export or how you interact with it, so I don't know there is a simple opportunity to perform such a check and injection.

newren commented 3 years ago

I did find this: ... git cat-file -p commit_id

Sure, note that I also included this exact command in my response above in addition to the "grafted" annotation that the log command uses.

This does seem, with a little parsing, to get all parents. So in principle if there is no "from" in the fast-export stream, it's possible to check for the presence of the present commit sha listing in .git/shallow, and then if that's also there, find the parents from this and inject them? I really have no familiarity with details of how fast-export or how you interact with it, so I don't know there is a simple opportunity to perform such a check and injection.

That'd be a problematic way to handle it for a few reasons: (1) No "from" is an effect of fast-export.c, which comes from it handling a commit with no parents. Why wait until the commit is handled and it has already printed the wrong information to try to handle it differently? Switching from effect to cause, fast-export would instead want to check the parents of a commit before handling it...except that's not quite right either because... (2) No parents is casting an unnecessary net; some commits without parents are not grafts. log-tree.c has a "for_each_commit_graft(add_graft_decoration, filter);" call showing how to mark just the graft commits; then we can check each commit to see "is this a graft commit" rather than "is this a no-parent commit" since the former is what we really want. (3) This would still break history because it wouldn't get all the grafts. What if one of those shallow grafts is one of the parents of a merge commit? Your mechanism of looking for commits with no parent would continue erroneously transform merges into non-merges. You'd instead want to look for all grafts and then look up the real parents.

So, I'd say you'd want to use log-tree.c's marking of decorations (for_each_commit_graft(add_graft_decoration, filter)), use the lookup_decoration() machinery (that I'm not that familiar with) when working on handle_commit() in fast-export.c to determine when a commit is a graft, then load the commit object like cat-file.c does (read_object_file()) and manually parse it since commit.c's parse_commit_buffer() will just throw away shallow grafts. That'll give you a list of raw parent names (strings) in addition to actual commit objects in commit->parents, but the raw parent names will have more names since it'll also include the graft. Then you'd just need to modify the loop over commit->parents in fast-export's handle_commit() to loop over the raw parent names, and when the raw parent name corresponds to a non-graft you can use the original logic but when it's a graft you need to emit either a "from " or "merge ". That'll get the parents right, but the diffs will still be wrong. Whenever the first parent is a graft, you'd need to modify the code to mimic the full_tree behavior (just look for "full_tree" in that function) to get the diffs right.

All of this, of course, should only be done when --reference-excluded-parents is passed to fast-export (though filter-repo always does).

DougLeonard commented 3 years ago

Oops, yes, you did write that ;) I was actually thinking to check for shallow commits by looking for the sha in the .git/shallow file (which I wrote ;) ) , not just parentless commits, and yes probably the check for no from is redundant then and overly broad. I was also thinking to look for multiple parents (for merges), but I get your point about getting fast-export to just do it right when asked. I haven't digested all the details yet. This is all more plumbing level than I am quite familiar with yet, but you gave me a good bit to go on there, so it may sink in. Thank you.

jahess commented 9 months ago

I was certainly lost in the grafting and subtree part of this discussion. However, "shallow" is what led me here and it was about the only hit that I thought might be relevant.

I have a processing chain built around git-filter-repo. Works great. Thank you newren!

I need to add a new feature that deals with shallow clones. As a quick trial of the git underpinnings I:

1) git cloned a repository with --depth 1. 2) git fast-exported that clone to a file. 3) then I fast-imported that file into a newly initialized repository.

This resulted in a correct reproduction of the files but the one commit hash doesn't match the original. This is a non-starter for me -- I need commit hashes to match. I'm assuming it is because the "from" commit that was recorded in the original code isn't present and rather than erring on a missing "from," fast-import is just doing the best it can.

I'm assuming I need to "fake out" the "from"(s) on the commit. Any advice on whether inserting some fast-import command(s) could help me with this? Alias seems like it might be a candidate but... blind trial and error can be so time consuming for this sort of thing.

Also, clients are very hard to get updated so I have only tried with git 2.31 and such.

Thanks in advance!

jahess commented 9 months ago

Started the trial and error...

To the commit added the missing: from <hash> fatal: Not a valid commit: <hash>

Tried changing the new from to "merge" and same result.

Before the "reset... commit..." added: alias mark :99999 to <hash>

Then added: "from :99999" after the commit message and same: fatal: Not a valid commit: <hash>

Replaced "from" with "merge" and same result.

Needing another idea to try.

jahess commented 9 months ago

Trying fast-import with a --import-marks file of: :99999 <hash>

fatal: object not found <hash>

jahess commented 9 months ago

Discovered that shallow copies cant be pushed to bare repositories. That's a problem for my use case.

I then tried letting git fast import recreate the shallow copy (that would no longer be marked shallow) but would result in a mismatching hash. Then I tried a replace ref to the correct desired hash. Looked like it would work until I tried pushing the refs and the repo to a bare repo and... unpack failure. Seems like that should have worked but maybe I'm trying to use replace in an upside down sort of way.