newren / git-filter-repo

Quickly rewrite git repository history (filter-branch replacement)
Other
8.36k stars 701 forks source link

Manually handle stashes (git-filter-repo deletes them currently) #115

Closed polarathene closed 2 months ago

polarathene commented 4 years ago

I had an existing repo and needed to correct a mistake for the author of commits:

git-filter-repo --name-callback 'return name.replace(b"OldName", b"NewName")'

That raised a helpful error:

Aborting: Refusing to overwrite repo history since this does not
look like a fresh clone.
  (expected freshly packed repo)
To override, use --force.

Not evident what the consequences are, I have some unstaged changes stashed but should otherwise be fine, so I run the command again with --force. Some of the output:

Parsed 11 commits
New history written in 0.19 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects

I guess my stashed changes were deemed "old uneeded objects"? Is there a way to run the operation without removing my stash(es)?

I had a backup so no data loss, but either the original warning should caution that the operation will be destructive elsewhere(I don't know if only stashes are at risk here), or this is a bug and my stash should have been preserved?

polarathene commented 4 years ago

The docs seem to touch on this right at the start that it is destructive. The safeguard error is then mentioned as a feature right after that.

The error in the terminal should probably re-iterate that the action is destructive, and will lose any state that isn't consistent with a fresh clone.

polarathene commented 4 years ago

Just noticed it removed my remotes as well.

newren commented 4 years ago

filter-repo has multiple checks for whether something is a fresh clone. Checking for the existence of any stashes is one of them, though in your case an earlier check triggered the warning first.

Anyway, as you noticed, the docs start out at the very beginning by pointing out it's a destructive operation, and the DISCUSSION section points out multiple reasons why it's generally a really bad idea to push the rewritten history back to the original repository. Just in case folks don't read the docs, filter-repo has multiple ways of helping alert folks to problems: (1) it has a nice little warning if you aren't running in a fresh clone to warn you that its operation is destructive: Refusing to *overwrite* repo history... (emphasis added), and (2) it removes the origin remote to help signal that it's a bad idea to push back to the original repository. Hopefully these two steps would lead folks who don't understand the implications of a history rewrite to check out the docs before attempting to do things that might make a bigger mess (such as rewrite history they don't have a backup of, or pushing a rewritten history alongside old history and watching in horror when someone merges the old and new history together resulting in not only the old bad commits still existing but duplicate copies of nearly all commits and a much harder mess to clean up). So, as far as I can tell, it's all working as expected.

It's possible you were intended to rewrite only the history of one branch or something. filter-repo, as its name implies, is designed by default to rewrite the entire repository's history, though there are flags like --partial and --refs if you didn't want a full rewrite. However, that does run the risk of mixing old and new history (see the many places "mixing" appears as warnings in the manual), so proceed with caution if you decide to go that route.

polarathene commented 4 years ago

(1) it has a nice little warning if you aren't running in a fresh clone to warn you that its operation is destructive: Refusing to overwrite repo history... (emphasis added)

In the context of the operation (I came across git-filter-repo from a stack overflow thread advising it as a better git-filter-branch and that it was officially advised by the git docs), the assumption was that yes I am rewriting the repo history, I want to update the author of all the commits due to a mistake. That's the only destructive history that was expected and implied.

I probably should have read the docs, the warning couldn't have pointed out they existed or linked to them as well. I later discovered them in this repo under the Documentation folder, but previously had only gone over the README(which only mentions the -h flag). I had tried the wiki tab and found that did nothing.

I couldn't find any mention / example of the presumably common use-case I had here and when I did later find the docs after running the command, searching up the name-callback references didn't have much about caution of that damage. The docs you provide are a large plain-text txt file instead of markdown(it's similar, but some syntax rendering issues), I don't know how many users get to that point while trying to learn how to do the operation and want to read through the docs. The initial paragraphs do touch on it, but it's off-putting at a glance by this point. Perhaps making it bold/emphasized to stand out as more important?

This is a destructive operation which should not be used lightly; it writes new commits, trees, tags, and blobs corresponding to (but filtered from) the original objects in the repository, then deletes the original history and leaves only the new.

Then this section if converted to markdown could link to the referenced section, even this could just be bold instead of the above?:

See <> for more details on the ramifications of using this tool.


I really feel that the warning in the CLI could better communicate the risk of loss than it presently does, as I honestly wasn't expecting it to remove all that it did by using --force from the warning itself. The docs are more clear of the impact to the user:

Also, it's worth noting that there is an important safety mechanism:

  • abort if run from a repo that is not a fresh clone (to prevent accidental data loss from rewriting local history that doesn't exist anywhere else). See <>.

The safety mechanism clearly wasn't that effective, it did trigger, but amounted to more of an "are you sure y/n?" style warning to me.

However, as a service to users, we would like to provide an additional safety check beyond the documentation.

That's appreciated, and I hope you consider improving the CLI warning. Be more clear that local repo data unrelated to the operation is at risk, when suggesting --force, advise a backup is made, or someway to generate an undo.

I tried a --dry-run and it indicated nothing about the damage it would do such as removing a stash(which seems to count as two additional commits parsed), or the removal of remotes. It mostly just showed how many commits were parsed, nothing about particularly useful in this case about what was going to happen, or how many commits were going to be touched etc.

In short, it's perfectly fine to use --force to override the safety checks as long as you're okay with filter-repo irreversibly rewriting the contents of the current repository.

Again, while true, the assumption here was the history being rewritten is specifically what the operation describes, no side effects elsewhere. I understand modifying the commits would modify their hash which puts it out of sync with my remotes, but I wasn't expecting my remotes to be removed or other local repo data loss such as my stashes(I did backup in my case, but had to manually port those over).

Also, it is definitely NOT okay to recommend --force on forums, Q&A sites, or in emails to other users without first carefully explaining that --force means putting your repositories' data at risk.

--force in the CLI message warning, should likewise be clear that local repository data is at risk, then reference doc section to lookup.


(2) it removes the origin remote to help signal that it's a bad idea to push back to the original repository.

While I understand the reasoning for that. There was no mention of such behaviour on this projects README or in the CLI warning or execution, so it was unexpected.

For me, it was a recent personal project with only me using it and Gitlab with a copy of it but no other services like CI setup. Naturally the change I wanted to do was one I wanted to keep and to sync back to the origin remote. Some indication of the removal would be good for UX, additionally a way to prevent removal via an option/flag would be handy.


Hopefully these two steps would lead folks who don't understand the implications of a history rewrite to check out the docs before attempting to do things that might make a bigger mess

This issue serves to prove that's not really the case. The current CLI warning was insufficient at indicating more than the expected change would be involved in a rewrite of history.

For example, I can do an interactive rebase and squash some commits while having a stash and experience no loss of the stash, only the expected changes are applied.

As mentioned earlier, I wasn't aware of the docs, when I did find them after the damage. I did a search for "stash", but it wasn't mentioned at all in the documentation. I then looked for matches on the name-callback operation to see if there was any warnings there, nothing iirc. At that point I looked for destructive I think and went over the first paragraphs in docs.

If the CLI warning instead raised awareness of my repo data unrelated to the expected change being at risk too and to consult the docs for more information, I'd definitely have done so prior to the --force suggestion (since it seemed safe based on what I was doing, if I rm -rf a location, I expect that would be removed, not anything else on my filesystem..).


It's possible you were intended to rewrite only the history of one branch or something.

No, entire repo, across all my branches was expected and fine. But removal of my stashes and my remotes was unexpected, leaving me to wonder what else was potentially touched outside of what should have been regarding the operation.

It makes me a little less trusting of the project at that point, but I'll accept I'm in the wrong for not discovering the docs sooner and going through them. I do feel the CLI UX could be improved on this matter though, hence the issue.


--partial does seem to do what I would have wanted, keeping the remotes and stashes(albeit on a test run, I have two stashes, the new fast-import(the new local master branch) and original On master:(remote tracked master branch, without modifications)). Popping the original stash, seems to pop the latest(0 index) stash on the newly modified branch instead(I created a new stash with different files and it got popped).

Presumably a bug since I don't think you can have a stash on a remote branch..?(EDIT: I think the local master branch may have kept the prior commits locally still, they were just detached, reading through the docs probably explains what happened and maybe how to fix it).

polarathene commented 4 years ago

With the duplicate stash issue(via --partial), if I delete the stash from the older branch commits locally, it releases the commits.

In one experiment I checked out the older commit series that was hanging around(right on the image), and the changes by git-filter-repo appeared detached(left on the image), something I had not seen before, I was advised to fix it by deleting the attached stash, which released those commits(or at least the GUI removed them from the visualization, no clue about internal history.

IMG_20200621_142108

newren commented 4 years ago

fast-export and fast-import don't provide a way to iterate over the reflog (and the stash is the only place where it might make sense to do so), so it's basically off-limits to filter-repo in terms of what it rewrites. That means the stashes will be connected to an old copy of the commits that were not rewritten, while branches and tags will point to the new rewritten commits. Even though the old and new have similar commit messages, other changes (such as author/committer names in your case) have occurred which changes the hashes and thus mean that the two have no history in common. Folks can end up merging the two independent histories together, and then end up with a royal mess that is much, much harder to clean up. filter-repo tries hard to prevent that kind of mistake, but --partial and --refs make it a lot easier to trigger it, as you're noticing.

newren commented 4 years ago

Anyway, thanks for the calm and well reasoned arguments about doing something more in a few places. I reopened the ticket on Saturday and started working on it, but still have some more things to do. Since I only tend to have time for filter-repo work on Saturdays, and I'll be going on vacation soon, unless I can finish it this Saturday it may be a while before I update this issue.

polarathene commented 4 years ago

so it's basically off-limits to filter-repo in terms of what it rewrites.

When I did the author replacement operation, it also affected the author of the stash... so I'm not sure what you mean by off-limits there?

That means the stashes will be connected to an old copy of the commits that were not rewritten, while branches and tags will point to the new rewritten commits.

So is it a bug in the visualizer? (GitKraken in this case) I didn't test with any tags, but the branch was updated to point to the new rewritten commits(not in the screenshot as I checked out the older commit while experimenting).

The stash though as the screenshot shows is visible on the rewritten commits on the left. In the case of changing the author, the stashes were the same on both left(new) and right(old) commits, except for the author, and the stash name(eg, On Master:(original stash) became fast-import(new stash copy)).

I just deleted a file and stashed that changed, then ran a --partial of the --name-callback as shown in my opening message(with --force). This reproduced the issue with duplicate stash, the UI has an issue with the name fast-import for the stash afaik so it only shows WIP on. I used the CLI with git stash list however and got this error:

fatal: replace depth too high for object 5a2a104aa9f73c812be9d63b18b1ff3500d028d9

Screenshot_20200623_122828

I deleted the master stash, still had the error. Then I deleted the fast-import stash, no more errors, no output produced as expected. Deleted the file again and stashed it, git stash list produces the same error again for some reason(GitKraken shows the stash as the usual On master: now though)... I'm not sure what has caused that, perhaps the history is still messed. Prior to running the git-filter-repo command, the original stash worked fine with git stash list:

stash@{0}: On master:

Here is the state of the repo visualized with the altered author commits only and a new stash made after dropping the prior two(the old commits no longer visualize after deleting it's associated stash, presumably they might still be in the repo history).

Screenshot_20200623_123356

Just tried the command again, but in reverse, got the same error with --force --partial, --force and neither. I guess the partial rewrite broke the repo in some way.

git-filter-repo --name-callback 'return name.replace(b"newName", b"oldName")'
fatal: replace depth too high for object 5a2a104aa9f73c812be9d63b18b1ff3500d028d9
Error: fast-export failed; see above

I deleted the stash. Then ran git stash list, no output as expected. Tried the git-filter-repo command again, same error, same hash. That hash is the 2nd commit "Initial Commit".


filter-repo tries hard to prevent that kind of mistake, but --partial and --refs make it a lot easier to trigger it, as you're noticing.

Well I guess that's the answer then :)

polarathene commented 4 years ago

Since I only tend to have time for filter-repo work on Saturdays, and I'll be going on vacation soon, unless I can finish it this Saturday it may be a while before I update this issue.

No worries, no rush for it. I am aware of the issue myself as documented here :sweat_smile: Just wanted to give a heads up for helping other naughty users that don't read the start of the documentation before using, so they can hopefully avoid surprises/confusion.

Marcono1234 commented 3 years ago

For reference, the current error for non-fresh clones is:

Aborting: Refusing to destructively overwrite repo history since this does not look like a fresh clone. (expected freshly packed repo) Please operate on a fresh clone instead. If you want to proceed anyway, use --force.

I also misunderstood this and assumed that by "destructive" it means that hashes of edited commits will change, not that unrelated things are deleted. It would be good to explicitly list the destructive actions performed to prevent any confusion, and to only recommend that --force should be used for cases where the status of the repository was incorrectly detected.

jeiea commented 2 years ago

I faced this alert today. It makes me to want to know why my repo is not a fresh clone. I think the error message should have more detail, especially what commits can be lost.

repo> git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean
repo> git branch -a
* main
  puppeteer
  remotes/origin/main
  remotes/origin/puppeteer
repo> git log --oneline --graph --decorate --all | select -first 7
* 08c464c (origin/puppeteer, puppeteer) stash:
| * fac417e (HEAD -> main, origin/main) feat:
| * b4c63e2 feat:
| * b3d3180 feat:
| * bc44c3f feat:
|/
* d266d56 feat:
repo> git stash list
repo> git filter-repo --path deno
Aborting: Refusing to destructively overwrite repo history since
this does not look like a fresh clone.
  (expected freshly packed repo)
Please operate on a fresh clone instead.  If you want to proceed
anyway, use --force.
Clumsy-Coder commented 2 years ago

@polarathene There is a way to recover stashes (that seems to be not visible from git stash).

I needed to rewrite all my commit history author email using the command git filter-repo --force --mailmap my-mailmap Got it from https://stackoverflow.com/a/58263677/3053548

After I ran it, my stashes were gone. I needed to recover them.

in the repo, theres a file .git/filter-repo/ref-map. That file contains commit ID before you ran git filter-repo command and commit ID after you ran it.

the format as goes <old commit ID> <new commit ID> /refs/stash

You can still access the detached stash using the new commit ID. git stash show <new commit ID>

Since stashes are a type of commit. It should still be there.

newren commented 2 years ago

So, a possible idea here:

Could also potentially do this for other reflogs...

newren commented 2 months ago

So, a possible idea here:

  • Parse the $GIT_DIR/log/refs/stash file, before filtering.
  • Add all refs (2nd column) to the arguments passed to fast-export
  • After filtering, look at the commit mapping and rewrite all the commits in columns 1 and & of $GIT_DIR/log/refs/stash

Implemented in 5cfd52f02514 (filter-repo: rewrite the stash too, 2024-07-31).

Could also potentially do this for other reflogs...

Didn't do that; I think handling it for stash is enough.