Open jojastahl opened 4 years ago
This seems to have to do with the order in which fast-export outputs the commits. I found a workaround by calling fast-export myself and giving --date-order --all
as what to output.
The command line used:
git fast-export --show-original-ids --signed-tags=strip --tag-of-filtered-object=rewrite --fake-missing-tagger --reference-excluded-parents --no-data --mark-tags --use-done-feature --date-order --all|git filter-repo --stdin --force --replace-refs delete-no-add
Proposal:
Add a --date-order
option to filter-repo, or maybe even making that the default.
Very interesting problem. I'm not sure how to create a repository that duplicates this. The only way I could imagine this triggering is if one commit was output multiple times by fast-export (which violates some basic assumptions that I thought fast-export worked under)...and further, that with one of the times it handled that commit it ended up modifying it, while with the other time it decided to prune it.
Can we check if something like that is indeed happening? If you go to your original repo and run
git fast-export --show-original-ids --no-data --all | \
grep ^original-oid.c046c372ff5309fd9019f6d23e330850d1706113$
does it show multiple lines of output? Maybe it'd be good to check for other objects being repeated too; does
git fast-export --show-original-ids --all | grep ^original-oid | sort | uniq -cd
show any output from that repo? (You might need to add some of the fast-export options dealing with tags to those command if the fast-export fails mid-run on a tag-related issue.)
Funnily enough, my first commit ever for git.git was 784f8affe4 ("fast-export: ensure we traverse commits in topological order", 2009-02-10), which came from the fact using date order would actually result in some bad munging of the contents of the repository. Perhaps some other change to fast-export in the mean time has fixed that, but I'm pretty leery of using it. Even if it was safe, though, I think it'd only fix your issue through luck; if the timestamps on some commits were swapped then I can't see how you could possibly not just run into the same problem.
I'll do the checks you asked for, probably tomorrow. This commit was not the only one, I had more behaving like this. As far as I remember it was not pruned.
As I understand the git docs, topo order will output one branch after the other, as long there is no commit depending on the other, but date order will keep the date order as long as a commit is not used as parent and its tree must therefore be output first.
So alphabet order giving date order, you would get :
Topo output order:
A---C---D--------------F
\-------------B---E--/
Date output order:
A---------C---D--------F
\--B-------------E--/
So if D mentions B in its commit message, in topo order you cannot resolve that reference to the new Sha of B as you did not see B yet.
As I modified my conversion system a little bit I do not have the same hash any more, but a comparable case:
Before conversion:
git log --grep 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1 --all
commit 3878c13c9d9cb23605232d507b66070ef9962472
...
(cherry picked from commit 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1)
After conversion:
git log --grep 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1 --all
commit 34eb9eb41da21f8ceb3337c049766d4acf25a09b
...
(cherry picked from commit 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1)
commit map:
3878c13c9d9cb23605232d507b66070ef9962472 34eb9eb41da21f8ceb3337c049766d4acf25a09b
5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1 9e010cad1216d7bf076f2ea999024d4c5a237846
suboptimal issues:
The following commits were filtered out, but referenced in another
commit message. The reference to the now-nonexistent commit hash
(or a substring thereof) was left as-is in any commit messages:
b'443bdb1c'
b'5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1'
b'f3c9370d'
Your questions:
git fast-export --show-original-ids --no-data --all | \
grep ^original-oid.5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1$
original-oid 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1
# I added --no-data
git fast-export --show-original-ids --no-data --all | grep ^original-
oid | sort | uniq -cd
# No output
To show you the impact of topo / date ordering:
# Topo order
git fast-export --show-original-ids --no-data --all | grep "^original-oid.\(5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1\|3878c13c9d9cb23605232d507b66070ef9962472\)"
original-oid 3878c13c9d9cb23605232d507b66070ef9962472
original-oid 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1
# Date order
git fast-export --show-original-ids --no-data --date-order --all | grep "^original-oid.\(5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1\|3878c13c9d9cb23605232d507b66070ef9962472\)"
original-oid 5e8d6c5bb469e54d1f45ebb1052de5c45c0dc6a1
original-oid 3878c13c9d9cb23605232d507b66070ef9962472
Here's a simple way to reproduce that triggers the problem for me:
#!/bin/bash
rm -rf stupid
git init stupid
cd stupid
seq 1 10 >numbers
git add numbers
git commit -m Initial
git branch A
git branch B
git checkout A
seq 11 15 >>numbers
git commit -am more
git checkout B
seq 0 10 >numbers
git commit -am EarlierStart
git cherry-pick -x A
Now, after running that script to set up a repo, I find that
git fast-export --show-original-ids --no-data --all
and
git fast-export --show-original-ids --no-data --date-order --all
both output the commits in the same order, and will show the commits from branch B (which includes the cherry-pick) before the commits from branch A (which includes the commit which was cherry-picked), so that we don't have the data we need to rewrite the "cherry picked from..." message to refer to the new commit id.
In other words, --date-order doesn't help at all. The fact that it does help in your case is great but cannot be relied upon. And even if it did function as a workaround, --date-order will slow fast-export down, so a better solution would still be desirable.
I think the fix needs to go into fast-export; it probably needs an extra flag (named something like --commit-reference-order) which will make fast-export look at the commit message and take an commit references and treat them similar to parents in one way: don't show a commit until after its parents/references have first been shown.
This relates me to what I think of is the biggest flaw in git:
cherry-pick isn't recordered properly. In the standard case it isn't recorded at all, or it is only recorded in the commit message.
I would prefer a system where a cherry-pick would create a pointer to source revision in a way that a merge links to the parent. Opposed to a merge parent link which includes the complete history, the cherry-pick link would include (or mention) only the referenced commit. Then there would be a standard (and probably fast) way for tools to show which commits from one branch have been picked into another, even if the diff is not exactly the same - which it isn't in many cases when you need to pick bug fixes or even features that a customer requests in his project.
That is THE feature that I miss from SVN.
both output the commits in the same order, and will show the commits from branch B (which includes the cherry-pick) before the commits from branch A (which includes the commit which was cherry-picked), so that we don't have the data we need to rewrite the "cherry picked from..." message to refer to the new commit id.
It's likely that the order is wrong because the timestamp of each commit is the same (it is here, with a 1s resolution). After following each git commit
by a sleep 1
I'm seeing sensible orders, however there's no difference between them.
Here's an example which does have differences between the two orders. It makes use of git-svn clone to create the git repository because I couldn't quickly reproduce this with git directly, nor does it occur (in this example anyway) with e.g. svn-all-fast-export.
#!/bin/bash
rm -rf repo.svn co repo.git
svnadmin create repo.svn --compatible-version "1.7" # git-svn clone currently doesn't handle FS format 8
svn co file://`pwd`/repo.svn co
pushd co
mkdir -p trunk branches/branch
echo foo > trunk/file
cp trunk/file branches/branch/
svn add trunk branches
svn commit -m "1) Initial layout"
sleep 1 # For sanity (in case SVN is faster elsewhere and would yield same commit timestamps)
echo branch > branches/branch/file
svn commit -m "2) First change branch..."
sleep 1
echo trunk > trunk/file
svn commit -m "3) ... and then change trunk. This message refers to earlier r2."
popd
git svn clone -s file://`pwd`/repo.svn repo.git
cd repo.git
git fast-export --show-original-ids --no-data --topo-order --all >topo
git fast-export --show-original-ids --no-data --date-order --all >date
diff -u date topo | wc -l
This is of course a contrived example and while it has to make use of the way git-svn clone decides to work here (introducing an extra commit) the end result is the same as what can happen with svn to git migrations: referencing a commit in a message, before having seen the commit that's referred to. In the past using date order always fixed these commit message conversion issues for me and others as well. It (still) requires patching git-filter-branch though. Thanks to git-filter-repo that's no longer needed and more importantly a conversion's bottleneck once again is creating the git repository, and not 'merely' post-processing its commit mesages 👍 (I kept my promise to upgrade!).
Hi, you build a great tool which saved me much time!
I have the issue, that multiple commit hash references in commit message are not updated. These references get listed in
suboptimal-issues
but, the commit that is referenced still exists in the filtered repo, however with a new hash. That new hash is correctly listed in thecommit-map
file. Example:Commit message:
suboptimal-issues:
commit-map:
I ran the command:
I cannot publish the repo, but the the basic thing I'm doing is:
Is this a known problem?