newren / git-filter-repo

Quickly rewrite git repository history (filter-branch replacement)
Other
8.55k stars 708 forks source link

Is there a way to remove duplicate commits? #535

Closed DoubleCouponDay closed 1 month ago

DoubleCouponDay commented 10 months ago

I used the BFG Repo cleaner to remove large files but forgot to clone fresh copies after pushing. Now my main trunk is full of duplicate commits. Is there a git-filter-repo command that can remove them?

newren commented 3 months ago

So, I'm guessing that you did a git pull, which merged the two different versions of history.

You'll want to find three different commits using git log:

Solution 1

If all the commits in your history since that merge are not merge commits, then you could try rebasing your commits on top of the good history. Something like:

    git rebase --onto ${FINAL_COMMIT_OF_BFG_REWRITTEN_HISTORY} ${MERGE_COMMIT}..HEAD

If you have any merge commits in your history since ${MERGE_COMMIT}, though, this would just mess things up.

Solution 2

Create a replace object that is a new commit like ${FIRST_NEW_COMMIT} but which has ${FINAL_COMMIT_OF_BFG_REWRITTEN_HISTORY} as its parent instead of having ${MERGE_COMMIT} as its parent. Then use filter-repo to rewrite the history:

   git replace --graft  ${FIRST_NEW_COMMIT} ${FINAL_COMMIT_OF_BFG_REWRITTEN_HISTORY}
   git filter-repo --proceed

A word of caution: if you have multiple commits that have ${MERGE_COMMIT} as a parent, you'll need to create new graft commits for all of them. N such commits, means you'll need to run git replace --graft ... N times. You only need to run git filter-repo --proceed once, but it needs to be after all N git replace --graft ... calls.

Solution 3

This one I can't give you any pseudo-code for. If you can do a filtering operation that will again modify the old commits to match the new commits, but which simultaneously is a no-op on the new commits, and which will remove the now-degenerate merge commit, that would also solve this problem. I don't remember details in terms of what additional modifications BFG makes (like [formerly OLDHASH] and Former-commit-id: and `.REMOVED.git-id) and whether it has added more or changed them, and you really need to be careful to filter in precisely the same way it did or you'd end up with even more variants. While this could theoretically be done with git-filter-repo, since it has to be filtered in precisely the same way it'd probably be easier to do by running bfg again. Even then, I'm not sure if running bfg again would really satisfy the constraints of being exactly the same for old commits while being a no-op for the previously-filtered or new commits. But, if you can nail it exactly, then this method would remove your duplicate commits by mapping multiple commits to one. Hopefully, mapping multiple to one wouldn't trigger any weird bugs in BFG.

Summary

Anyway, between the three, I suspect solution #2 is the most robust and easiest. Does that help?

celinesin commented 3 weeks ago

I think I have a related problem, and I got a bit lost in the manual of how to use git filter-repo:

I used git filter-branch to remap the email address from an autogenerated email to a "correct" one. I pushed this to my repo and all appeared fine. Then, I also didn't clone fresh copies on all my machines (opps). Now I have a repo with many duplicate commits (one with the wrong email address, and one with the new correct email address).

I was thinking I could just use git filter-repo --commit-callback to try to clean up the duplicated entries (running once for each "wrong" email address).

I tried git filter-repo --commit-callback ' if commit.author_email == b"WRONGEMAIL": commit.ignore = True ' But after that git log still contains entries with WRONGEMAIL -- is there a 2nd step that I've missed?

Thanks in advance!