newren / git-filter-repo

Quickly rewrite git repository history (filter-branch replacement)
Other
8.22k stars 700 forks source link

preserve git notes #22

Open glensc opened 4 years ago

glensc commented 4 years ago

Given simple "remove this file from repo", the git notes get discarded:

$ git filter-repo --invert-paths --path-glob 'Auth*.php'
glensc commented 4 years ago

for now, I've found that I can recover the notes from the replace refs:

git for-each-ref refs/replace/ --format='%(objectname) %(objecttype) %(refname:lstrip=2)' \
| while read new type old; do
    if [ "$type" != "commit" ]; then
        exit 1
    fi

    git notes copy $old $new
    git notes remove $old
done
newren commented 4 years ago

Right, the notes aren't discarded, they simply continue to refer to the old commit ids, i.e. they are not connected to the rewritten history. This is a known issue that I had been tracking in my personal todo list, which I probably should have opened a ticket for. Thanks for opening it, and for providing a workaround for others.

Challenges here (mostly notes to self):

Additionally:

Anyway, a proper fix probably requires a fair amount of git surgery to the notes, revision walking, fast-export, and fast-import areas of the code (and I'm not that familiar with the former two). And I'm not yet sure on the exact structure that should take.

galsi commented 4 years ago

for now, I've found that I can recover the notes from the replace refs:

git for-each-ref refs/replace/ --format='%(objectname) %(objecttype) %(refname:lstrip=2)' \
| while read new type old; do
    if [ "$type" != "commit" ]; then
        exit 1
    fi

    git notes copy $old $new
    git notes remove $old
done

Hi I am new to Git scripting - how do i ignore commits with no notes ?

glensc commented 4 years ago

@galsi depending on the error you get, if you're using my script from the above you can just ignore the error from git notes copy and skip git notes remove

git notes copy $old $new && git notes remove $old
galsi commented 4 years ago

@galsi depending on the error you get, if you're using my script from the above you can just ignore the error from git notes copy and skip git notes remove

git notes copy $old $new && git notes remove $old

Hi thanks for quick answer. i am getting the following error: missing notes on source object 0023c2404229f355868857b3ae7bcdec33f7a9c6. Cannot copy. Is there an option to select only the commits with notes , this could save time i am working on a repo with more then 125000 commits

uqs commented 4 years ago

Hi folks, I'm investigation the use of this, but for a repo with about 350k commits that have their notes replaced, the workaround is simply too slow. I recon it would take about 24h to complete.

woopla commented 4 years ago

@uqs same here. It took around 24h for the same amount of commits, so your estimate is correct. I thought I'd be smart and xargs -P8 the whole thing, but there are locks in git notes add that make this a pain to handle...

glensc commented 4 years ago

maybe, as a quicker solution (but still workaround) is that git-filter-repo creates map file which could be applied as bulk instead of two git command invocations for each replaced commit. 💸

newren commented 4 years ago

maybe, as a quicker solution (but still workaround) is that git-filter-repo creates map file which could be applied as bulk instead of two git command invocations for each replaced commit.

I started some work on an actual solution about a month ago, involving changes to both core git and git-filter-repo. The majority of the work is in core git, and the current work in progress can be seen on my fast-export-notes branch of the newren/git repo on github (and by work in progress I mean it won't work for anyone yet, it just has some useful changes). As filter-repo is almost completely maintained in my free time, I'll finish it when I finish it; probably in a month or two and make it part of the 2.29.0 release (i.e. not the upcoming 2.28.0 release).

That all said, git-filter-repo already creates a map file at the end of its run, and it recently became documented: see #117.

woopla commented 4 years ago

@glensc nice suggestion! I used the following to go from 24h to 1h runtime:

tail -n +2 .git/filter-repo/commit-map | git notes --ref=cvs copy --force --stdin
tail -n +2 .git/filter-repo/commit-map | awk '{print $1}' | git notes --ref=cvs remove --ignore-missing --stdin

Some of that time might actually be the 'removing note for object' messages that can be improved by redirecting to /dev/null.

ymartin59 commented 3 years ago

I have a trouble with latest git filter-repo and git 2.29 A simple "--path" processing has discarded refs/notes/commits and its related tree. Is there a reason and a work-around to this behaviour?

glensc commented 3 years ago

@ymartin59 did you even read the comments?

the second comment explains the recovery method:

Fantabrain commented 3 years ago

@ymartin59 did you even read the comments?

the second comment explains the recovery method:

* [#22 (comment)](https://github.com/newren/git-filter-repo/issues/22#issuecomment-558693518)

@glensc You've brushed off the complaint from @ymartin59 without understanding the issue.

Your workaround assumes that the notes are still present under refs/notes/commits, just targeting the wrong commit IDs. However, this will not be the case if filter-repo was run with --path to include only certain paths in the filtered repo. The reason is, a note is actually a commit object which contains the note as a file whose name is the commit ID. But such files will be excluded by --path and so the note commits will be discarded entirely.

I just ran git filter-repo --path someFile and there is literally no refs/notes/commits ref at all in the new repo.

Being that I have only a few notes, I thought I could correct this by redoing the filtration (from a new clone of the original) to include the relevant commit IDs as paths, such as:

git filter-repo --path 20e60aa8caf74c9ca4a4207ad7924f6aec0989b9 --path someFile

However, this for some reason did not work and there is still no refs/notes/commits chain in the new repo. [Edit: this was because git clone didn't even copy it; see next comment]

As such, there is no point to even run your script [where these issues are not handled first] because it will inevitably fail because the notes are actually totally gone from the repo.

I have no idea how to fix this, other than maybe filter paths by exclusion instead of inclusion, which is tedious when you're trying to make a new repo out of only one file from a parent repo with many files, deleted files, etc.

Fantabrain commented 3 years ago

Further to what I noted to @glensc about refs/notes/commits potentially being missing due to filter-repo being run with --path (and without --invert-paths), there is another reason the note commits may be totally missing.

git clone normally does not copy notes. It will if you use --mirror, but this also implies --bare, which is often not what you want. To overcome, you can run the following before filter-repo:

git fetch origin refs/notes/*:refs/notes/*

Unfortunately, this causes the fresh clone check to fail, so then you have to use --force to filter-repo, which creates a potential for data loss from user error. I've noted this as a separate issue #254.

Once you have a clone that does contain the refs/notes/commits chain, then you can use the script from @glensc to transfer the notes onto the rewritten commits.

glensc commented 3 years ago

You can add this to your global git config. I believe it fetches notes and leaves clone "pristine'.

run git config -e --global and insert this:

[remote "origin"]
    fetch = +refs/notes/*:refs/notes/*
    fetch = +refs/pull/*/head:refs/remotes/origin/pr/*
    fetch = +refs/merge-requests/*/head:refs/remotes/origin/mr/*

aside, how does git-filter-repo consider repo clean? inspects reflog? perhaps do the reverse and clear reflog?

uqs commented 3 years ago

@Fantabrain the notes objects might be under a subdir hierarchy. Please check with git ls-tree refs/notes/commits, or just try all of the following paths.

20e60aa8caf74c9ca4a4207ad7924f6aec0989b9
20/e60aa8caf74c9ca4a4207ad7924f6aec0989b9
20/e6/0aa8caf74c9ca4a4207ad7924f6aec0989b9
Fantabrain commented 3 years ago

@uqs thank you for the suggestion. What I ended up doing was just getting a list of all files from filter-repo --analyze, removing from it the desired files including all the notes, and then passing it to filter-repo as the list of files to remove.

I did this before discovering that the clones didn't even contain refs/notes/commits, so I was already using subtractive path mode before anything would have worked. But I highly suspect that using this for each note would have also worked:

--path 20e60aa8caf74c9ca4a4207ad7924f6aec0989b9

I really don't think you'd need notation such as 20/e6/0aa8... or indeed that this would even work.

If you look at the paths in refs/notes/commits even just via gitk --all, you see they're just the 40 digit hashes at the root level. They don't appear to use the aa/bb/ccdd... scheme that the storage back-end uses to store objects physically. These are basically user file paths within the repo (stored the same way user files are stored), so that scheme probably has no benefit given how Git deals with such paths. That scheme is mainly intended to overcome limitations and inefficiencies of some physical filesystems, which wouldn't apply here.

uqs commented 3 years ago

That just means you don't have many notes objects. See this for what happens when you have 400k notes or so: https://cgit.freebsd.org/src/tree/?h=refs/notes/commits

The code that creates these additional levels is here: https://sourcegraph.com/github.com/git/git/-/blob/notes.c#L497-535

But I guess it's all mood anyway.

Fantabrain commented 3 years ago

@uqs Ah, ok, I would not have expected that, but I guess maybe Git's internal handling of repo paths does have its own bottlenecks on very large directories. Or else I'm not sure why they would have done that, given the added complexity. Since these paths are in permanent commits in refs/notes/commits, I don't even want to think about what happens when the "fanout" changes; either a commit that moves all the paths, or having to account for different notes being on different fanouts.

newren commented 3 years ago

Yeah, notes are a bit of a problem; they really should be handled differently as special objects because they have paths that look like hashes with a semi-random number of slashes inserted to fan them out into directories.

I suspect the easiest workaround to the --path issue would be to (1) do filtering as normal, then (2) from the filtered repository manually fetch the notes from the original repository (git fetch original_url refs/notes/*:refs/notes/*; this should be safe from the normal risks of mixing old and new history due to the fact that notes have a completely independent commit history from normal commits; the only tie between notes and the normal history is that the filenames in the notes history refer to commits from the normal history), and then finally (3) using the workarounds suggested above by @glensc or @woopla.

Sorry for taking so long to get back to this issue. merge machinery, rename detection, directory traversal stuff, etc., etc. have all taken way more time than expected...

GrantEdwards commented 10 months ago

I've read through all of the comments, and I'm still stumped. I can't figure out if there is a solution to this problem or not. Much of it was a little over my head, so feel free to point me to the right documentation.

I'm doing a

git filter-repo --subdirectory-filter $Dir

That appears to work as expected:

+ git filter-repo --subdirectory-filter dirname1
Parsed 1201 commits
New history written in 0.91 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at b2da92c Remove library when cleaning
Enumerating objects: 251, done.
Counting objects: 100% (251/251), done.
Delta compression using up to 8 threads
Compressing objects: 100% (129/129), done.
Writing objects: 100% (251/251), done.
Total 251 (delta 122), reused 183 (delta 117), pack-reused 0
Completely finished after 1.07 seconds.

If I do a git log immediately before doing the filter-repo, every commit has a note as expected.

After the filter-repo, git log shows no notes.

I've tried the original work-around script, and all it does is print errors like this for every iteration of the loop:

error: missing notes on source object 8f278b90f20b19f40edc0a74fc7543757a89c622. Cannot copy.
Object 8f278b90f20b19f40edc0a74fc7543757a89c622 has no note

I've also tried this

git fetch $SrcRepo refs/notes/*:refs/notes/*

And that appears to work OK

+ git fetch ../orig.bare 'refs/notes/*:refs/notes/*'
remote: Enumerating objects: 2310, done.
remote: Counting objects: 100% (552/552), done.
remote: Compressing objects: 100% (414/414), done.
remote: Total 2310 (delta 285), reused 0 (delta 0), pack-reused 1758
Receiving objects: 100% (2310/2310), 221.52 KiB | 14.77 MiB/s, done.
Resolving deltas: 100% (1309/1309), done.
From ../apps-common.bare
 * [new ref]         refs/notes/commits -> refs/notes/commits

But there are still no notes when I do a git log, because I assume I'm missing the suggested step "(3) using the workarounds suggested above by @glensc or @woopla", but I don't know what workarounds are referred to in step 3.

GrantEdwards commented 10 months ago

I've tried the original work-around script, and all it does is print errors like this for every iteration of the loop:

[...]

I've also tried this

git fetch $SrcRepo refs/notes/*:refs/notes/*

[...]

But there are still no notes when I do a git log, because I assume I'm missing the suggested step "(3) using the workarounds suggested above by @glensc or @woopla", but I don't know what workarounds are referred to in step 3.

Of course minutes after posting, upon reading through the comments again, it became glaringly obvious that I needed to do both: fetch the refs/notes, then run the recovery script. I don't know how I didn't grok that the first couple times I read things...

hawicz commented 10 months ago

Fyi, here's an improvement to @woopla's commands from 2020, above, which avoids dropping notes for commits that didn't get re-written:

tail -n +2 filter-repo/commit-map | git notes copy --force --stdin
tail -n +2 filter-repo/commit-map | awk '$1 != $2 {print $1}' | git notes remove --ignore-missing --stdin