newren / git-filter-repo

Quickly rewrite git repository history (filter-branch replacement)
Other
8.09k stars 691 forks source link

Contents miss after --subdirectory-filter on huge repo #502

Closed superyyrrzz closed 1 year ago

superyyrrzz commented 1 year ago

I am on Windows 11 + Python 3.11 + git 2.41. --subdirectory-file works well for a average size repo:

python3 d:\git-filter-repo --subdirectory-filter {folder}
Parsed 10289 commits
New history written in 4.46 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at 4261bf1f Merge pull request #xxxx from xxxx
Enumerating objects: 3161, done.
Counting objects: 100% (3161/3161), done.
Delta compression using up to 8 threads
Compressing objects: 100% (1986/1986), done.
Writing objects: 100% (3161/3161), done.
Total 3161 (delta 1309), reused 2870 (delta 1090), pack-reused 0
Completely finished after 10.65 seconds.

However, when I run this on a huge repo (like https://github.com/MicrosoftDocs/azure-docs ~30GB), it failed to rewrite git history, although the output said "Completed":

python3 d:\git-filter-repo --subdirectory-filter articles\api-center                                                                                                                       Parsed 1252380 commits
New history written in 2505.65 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
Updating files: 100% (88087/88087), done.
Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
Writing objects: 100% (1/1), done.
Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
Completely finished after 2577.93 seconds.

After that, my repository folder became empty with only .git remaining. The reop size reduced from ~30GB to ~100MB as all contents were missing.

By comparing these 2 logs, something went wrong since Enumerating objects: 1, done.

newren commented 1 year ago

python3 d:\git-filter-repo --subdirectory-filter articles\api-center

Odds are relatively high that the GitBash shell or whatever you are using is going to pass --subdirectory-filter and articlesapi-center (note the lacking divider between articles and api-center) to git-filter-repo, so git-filter-repo has no way of knowing what you actually intended. You could verify by adding --debug to the command.

Even if the shell you are using it did pass articles\api-center, this repository does not have a directory with that name. Instead, it has a directory named articles which has a subdirectory named api-center, thus a directory named articles/api-center. Paths need to be named the way git stores them internally, which means with a / rather than a \ (much the same way they are shown with git log --name-status or git log -p).

Anyway, whether the shell passed articles\api-center or articlesapi-center, git-filter-repo went and removed all directories except for that one, and since that one doesn't exist, you end up with an empty repository.

Luckily, it's an easy fix -- just reclone and replace the \ with a /.