newren / git-filter-repo

Quickly rewrite git repository history (filter-branch replacement)
Other
8.55k stars 708 forks source link

Remove all files from Git repo history with path having escape \ in filename with git filter-repo #427

Closed klorinczi closed 1 year ago

klorinczi commented 1 year ago

Hi,

I try to remove all files from Git repo history with path having escape \ in filename with git filter-repo.

I have special filenames with escape \ characters stored in Git repository on Debian 10 Linux.

Problem: it is not possible to git checkout files on Windows, which have incompatible characters in the filename.

Problem reproducing steps:

    # Clone repository, to be executed on a safe repo:
    git clone --no-local /source/repo/path/ /target/path/to/repo/clone/
    # Cloning into '/target/path/to/repo/clone'...
    # remote: Enumerating objects: 9534, done.
    # remote: Counting objects: 100% (9534/9534), done.
    # remote: Compressing objects: 100% (4776/4776), done.
    # remote: Total 9534 (delta 4215), reused 8043 (delta 3136), pack-reused 0
    # Receiving objects: 100% (9534/9534), 7.41 MiB | 16.78 MiB/s, done.
    # Resolving deltas: 100% (4215/4215), done.

    cd /target/path/to/repo/clone/

    # List the files with escape \ from repo history into a list file:
    git log --all --name-only -m --pretty= '*\\*' | sort -u >/opt/git_repo_files_w_escape.txt

    # Remove the files with escape \ from repo history:
    git filter-repo --invert-paths --paths-from-file /opt/git_repo_files_w_escape.txt
    Parsed 592 commits
    New history written in 0.25 seconds; now repacking/cleaning...
    Repacking your repo and cleaning out old unneeded objects
    HEAD is now at 71128f3 .gitignore: ADD snap-git to be ignored
    Enumerating objects: 9354, done.
    Counting objects: 100% (9354/9354), done.
    Delta compression using up to 8 threads
    Compressing objects: 100% (3694/3694), done.
    Writing objects: 100% (9354/9354), done.
    Total 9354 (delta 4085), reused 9354 (delta 4085), pack-reused 0
    Completely finished after 0.55 seconds.

    # List files with escape \ to check result:
    git log --format="reference" --name-status --diff-filter=A '*\\*'
    # "systemd/system/default.target.wants/snap-git\\x2dfilter\\x2drepo-7.mount"
    # "systemd/system/multi-user.target.wants/snap-git\\x2dfilter\\x2drepo-7.mount"
    # "systemd/system/snap-git\\x2dfilter\\x2drepo-7.mount"

    # Unfortunately, while it seems filter-repo 
    # was executed, but log still lists filenames 
    # with escape \ :-( 

Could be possible, that it is a bug?

newren commented 1 year ago

Nope, not a bug. You fed bad input into filter-repo, based on a common but incorrect assumption about how git log works.

Look at your own output:

$ git log --format="reference" --name-status --diff-filter=A '*\\*'
"systemd/system/default.target.wants/snap-git\\x2dfilter\\x2drepo-7.mount"
"systemd/system/multi-user.target.wants/snap-git\\x2dfilter\\x2drepo-7.mount"
"systemd/system/snap-git\\x2dfilter\\x2drepo-7.mount"

Let's look at the first line as an example. If you were to store that in a file, which you pass to --paths-from-file, then git-filter-repo is going to be looking for a file named "systemd/system/default.target.wants/snap-git\\x2dfilter\\x2drepo-7.mount" to remove. You have no such file in your repository. Instead you have one named systemd/system/default.target.wants/snap-git\x2dfilter\x2drepo-7.mount. (Note that I have removed both " characters and two of the \ characters.)

The problem here is that you assumed git log would list filenames as-is, which it won't do whenever there are special characters. You can often get around this by setting core.quotepath=false (this particularly helps when you have non-ascii characters), but even that is ignored when you have backslashes.

Here's something that might work better for you for generating the list of filenames to exclude:

git log -z --all --name-only -m --pretty= '*\\*' | tr '\0' '\n' | sort -u >/opt/git_repo_files_w_escape.txt

but it assumes you do not have filenames with newline characters. (If you do have files with newline characters, though, then --paths-from-file won't work for you.)

Does that help?

klorinczi commented 1 year ago

@newren Thank you very mush for pointing me to the right solution! Your solution works perfectly, it removed all files having backlash in filename. You are right, it is not a bug, just the git log result was not in the right format for input into git filter-repo.

I also opened a bountied question for this problem on Stackoverflow: https://stackoverflow.com/questions/75150145/remove-all-files-from-git-repo-history-with-path-having-escape-in-filename-wit If you are on Stackoverflow and you post the Solution reproducing steps, I would be happy to give you the bounty.

Solution reproducing steps:

    # Clone repository, to be executed on a safe repo:
    git clone --no-local /source/repo/path/ /target/path/to/repo/clone/
    # Cloning into '/target/path/to/repo/clone'...
    # remote: Enumerating objects: 9364, done.
    # remote: Counting objects: 100% (9364/9364), done.
    # remote: Compressing objects: 100% (3706/3706), done.
    # remote: Total 9364 (delta 4088), reused 9346 (delta 4082), pack-reused 0
    # Receiving objects: 100% (9364/9364), 7.44 MiB | 22.29 MiB/s, done.
    # Resolving deltas: 100% (4088/4088), done.

    cd /target/path/to/repo/clone/

    # List the files with backslash from repo history into a list file:
    git log -z --all --name-only -m --pretty= '*\\*' | tr '\0' '\n' | sort -u >../git_repo_files_w_escape.txt

    # check the output file content
    nano ../git_repo_files_w_escape.txt

    # Remove the files with backslash from repo history:
    git filter-repo --invert-paths --paths-from-file ../git_repo_files_w_escape.txt
    # New history written in 0.60 seconds; now repacking/cleaning...
    # Repacking your repo and cleaning out old unneeded objects
    # HEAD is now at 91d7141 
    # Enumerating objects: 9362, done.
    # Counting objects: 100% (9362/9362), done.
    # Delta compression using up to 8 threads
    # Compressing objects: 100% (3739/3739), done.
    # Writing objects: 100% (9362/9362), done.
    # Total 9362 (delta 4087), reused 9305 (delta 4047), pack-reused 0
    # Completely finished after 1.22 seconds.

    # List files with backslash to check result:
    git log -z --all --name-only -m --pretty= '*\\*' | tr '\0' '\n' | sort -u
    # empty result, so history rewrite was successful!

I'm grateful for the solution, thanks again!

newren commented 1 year ago

I had created a stack overflow account, but it wouldn't let me comment on various answers saying I didn't have enough reputation, even when commenting on posts touching areas I was an or even the expert on. Frustrated, I just never bothered answering any questions again. But, you inspired me to to try to dig out my old account info and post my answer.

newren commented 1 year ago

Oh, also, it might even be easier to avoid generating the filenames entirely since you can just programatically check. Something like:

git filter-repo --filename-callback 'return None if b'\\' in filename else filename'
klorinczi commented 1 year ago

Excellent!

Thank you for the solution without exporting filenames, just programmatically replace the characters!

You could also add it your answer on Stackoverflow.

I would welcome a question upvote, because they downvoted it 🙂