newren / git-filter-repo

Quickly rewrite git repository history (filter-branch replacement)
Other
8.21k stars 698 forks source link

Commit hash is not rewritten when referenced in other commit message #108

Open jojastahl opened 4 years ago

jojastahl commented 4 years ago

Hi, you build a great tool which saved me much time!

I have the issue, that multiple commit hash references in commit message are not updated. These references get listed in suboptimal-issues but, the commit that is referenced still exists in the filtered repo, however with a new hash. That new hash is correctly listed in the commit-map file. Example:

Commit message:

...
(cherry picked from commit c046c372)
...

suboptimal-issues:

The following commits were filtered out, but referenced in another
commit message.  The reference to the now-nonexistent commit hash
(or a substring thereof) was left as-is in any commit messages:
  b'c046c372'

commit-map:

c046c372ff5309fd9019f6d23e330850d1706113 68700bc14143fcbbdbbd3f28ab4dd3491931a5fd

I ran the command:

git filter-repo --force --replace-refs delete-no-add

I cannot publish the repo, but the the basic thing I'm doing is:

Is this a known problem?

jojastahl commented 4 years ago

This seems to have to do with the order in which fast-export outputs the commits. I found a workaround by calling fast-export myself and giving --date-order --all as what to output.

The command line used:

git fast-export --show-original-ids --signed-tags=strip --tag-of-filtered-object=rewrite --fake-missing-tagger --reference-excluded-parents --no-data --mark-tags --use-done-feature --date-order --all|git filter-repo --stdin --force --replace-refs delete-no-add

Proposal: Add a --date-order option to filter-repo, or maybe even making that the default.

newren commented 4 years ago

Very interesting problem. I'm not sure how to create a repository that duplicates this. The only way I could imagine this triggering is if one commit was output multiple times by fast-export (which violates some basic assumptions that I thought fast-export worked under)...and further, that with one of the times it handled that commit it ended up modifying it, while with the other time it decided to prune it.

Can we check if something like that is indeed happening? If you go to your original repo and run

  git fast-export --show-original-ids --no-data --all | \
    grep ^original-oid.c046c372ff5309fd9019f6d23e330850d1706113$

does it show multiple lines of output? Maybe it'd be good to check for other objects being repeated too; does

  git fast-export --show-original-ids --all | grep ^original-oid | sort | uniq -cd

show any output from that repo? (You might need to add some of the fast-export options dealing with tags to those command if the fast-export fails mid-run on a tag-related issue.)

Funnily enough, my first commit ever for git.git was 784f8affe4 ("fast-export: ensure we traverse commits in topological order", 2009-02-10), which came from the fact using date order would actually result in some bad munging of the contents of the repository. Perhaps some other change to fast-export in the mean time has fixed that, but I'm pretty leery of using it. Even if it was safe, though, I think it'd only fix your issue through luck; if the timestamps on some commits were swapped then I can't see how you could possibly not just run into the same problem.

jojastahl commented 4 years ago

I'll do the checks you asked for, probably tomorrow. This commit was not the only one, I had more behaving like this. As far as I remember it was not pruned. 

As I understand the git docs, topo order will output one branch after the other, as long there is no commit depending on the other, but date order will keep the date order as long as a commit is not used as parent and its tree must therefore be output first. 

So alphabet order giving date order, you would get :

Topo output order:

A---C---D--------------F
   \-------------B---E--/

Date output order:

A---------C---D--------F
   \--B-------------E--/

So if D mentions B in its commit message, in topo order you cannot resolve that reference to the new Sha of B as you did not see B yet.

jojastahl commented 4 years ago

As I modified my conversion system a little bit I do not have the same hash any more, but a comparable case:

Before conversion:

git log --grep 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1 --all
commit 3878c13c9d9cb23605232d507b66070ef9962472
...
(cherry picked from commit 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1)

After conversion:

git log --grep 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1 --all
commit 34eb9eb41da21f8ceb3337c049766d4acf25a09b
...
(cherry picked from commit 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1)

commit map:

3878c13c9d9cb23605232d507b66070ef9962472 34eb9eb41da21f8ceb3337c049766d4acf25a09b
5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1 9e010cad1216d7bf076f2ea999024d4c5a237846

suboptimal issues:

The following commits were filtered out, but referenced in another
commit message.  The reference to the now-nonexistent commit hash
(or a substring thereof) was left as-is in any commit messages:
  b'443bdb1c'
  b'5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1'
  b'f3c9370d'

Your questions:

git fast-export --show-original-ids --no-data --all | \
    grep ^original-oid.5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1$
original-oid 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1
# I added --no-data
git fast-export --show-original-ids --no-data --all | grep ^original-
oid | sort | uniq -cd
# No output

To show you the impact of topo / date ordering:

# Topo order
git fast-export --show-original-ids --no-data --all | grep "^original-oid.\(5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1\|3878c13c9d9cb23605232d507b66070ef9962472\)"
original-oid 3878c13c9d9cb23605232d507b66070ef9962472
original-oid 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1

# Date order
git fast-export --show-original-ids --no-data --date-order --all | grep "^original-oid.\(5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1\|3878c13c9d9cb23605232d507b66070ef9962472\)"
original-oid 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1
original-oid 3878c13c9d9cb23605232d507b66070ef9962472
newren commented 4 years ago

Here's a simple way to reproduce that triggers the problem for me:

#!/bin/bash

rm -rf stupid
git init stupid
cd stupid

seq 1 10 >numbers
git add numbers
git commit -m Initial

git branch A
git branch B

git checkout A
seq 11 15 >>numbers
git commit -am more

git checkout B
seq 0 10 >numbers
git commit -am EarlierStart

git cherry-pick -x A

Now, after running that script to set up a repo, I find that

git fast-export --show-original-ids --no-data --all

and

git fast-export --show-original-ids --no-data --date-order --all

both output the commits in the same order, and will show the commits from branch B (which includes the cherry-pick) before the commits from branch A (which includes the commit which was cherry-picked), so that we don't have the data we need to rewrite the "cherry picked from..." message to refer to the new commit id.

In other words, --date-order doesn't help at all. The fact that it does help in your case is great but cannot be relied upon. And even if it did function as a workaround, --date-order will slow fast-export down, so a better solution would still be desirable.

I think the fix needs to go into fast-export; it probably needs an extra flag (named something like --commit-reference-order) which will make fast-export look at the commit message and take an commit references and treat them similar to parents in one way: don't show a commit until after its parents/references have first been shown.

jojastahl commented 4 years ago

This relates me to what I think of is the biggest flaw in git:

cherry-pick isn't recordered properly. In the standard case it isn't recorded at all, or it is only recorded in the commit message.

I would prefer a system where a cherry-pick would create a pointer to source revision in a way that a merge links to the parent. Opposed to a merge parent link which includes the complete history, the cherry-pick link would include (or mention) only the referenced commit. Then there would be a standard (and probably fast) way for tools to show which commits from one branch have been picked into another, even if the diff is not exactly the same - which it isn't in many cases when you need to pick bug fixes or even features that a customer requests in his project.

That is THE feature that I miss from SVN.

discnl commented 3 years ago

both output the commits in the same order, and will show the commits from branch B (which includes the cherry-pick) before the commits from branch A (which includes the commit which was cherry-picked), so that we don't have the data we need to rewrite the "cherry picked from..." message to refer to the new commit id.

It's likely that the order is wrong because the timestamp of each commit is the same (it is here, with a 1s resolution). After following each git commit by a sleep 1 I'm seeing sensible orders, however there's no difference between them.

Here's an example which does have differences between the two orders. It makes use of git-svn clone to create the git repository because I couldn't quickly reproduce this with git directly, nor does it occur (in this example anyway) with e.g. svn-all-fast-export.

#!/bin/bash

rm -rf repo.svn co repo.git

svnadmin create repo.svn --compatible-version "1.7" # git-svn clone currently doesn't handle FS format 8
svn co file://`pwd`/repo.svn co

pushd co

mkdir -p trunk branches/branch
echo foo > trunk/file
cp trunk/file branches/branch/
svn add trunk branches
svn commit -m "1) Initial layout"
sleep 1 # For sanity (in case SVN is faster elsewhere and would yield same commit timestamps)

echo branch > branches/branch/file
svn commit -m "2) First change branch..."
sleep 1

echo trunk > trunk/file
svn commit -m "3) ... and then change trunk. This message refers to earlier r2."

popd

git svn clone -s file://`pwd`/repo.svn repo.git

cd repo.git
git fast-export --show-original-ids --no-data --topo-order --all >topo
git fast-export --show-original-ids --no-data --date-order --all >date
diff -u date topo | wc -l

This is of course a contrived example and while it has to make use of the way git-svn clone decides to work here (introducing an extra commit) the end result is the same as what can happen with svn to git migrations: referencing a commit in a message, before having seen the commit that's referred to. In the past using date order always fixed these commit message conversion issues for me and others as well. It (still) requires patching git-filter-branch though. Thanks to git-filter-repo that's no longer needed and more importantly a conversion's bottleneck once again is creating the git repository, and not 'merely' post-processing its commit mesages 👍 (I kept my promise to upgrade!).