newren / git-filter-repo

Quickly rewrite git repository history (filter-branch replacement)
Other
8.52k stars 708 forks source link

[Question] Repo size not shrinking after using --invert-paths #573

Closed KoningLeon closed 1 month ago

KoningLeon commented 5 months ago

In the past my team stored reports including data in our Azure DevOps Git repo which resulted in a size of 13.2gb. Thankfully we've seen the light and bettered our ways last year so the repo currently hasn't contained any reports with data for a while now. I wanted to use your tool to also remove any history of the files for the sake of repo size and security. Unfortunately I haven't been able to reduce the size of the pack files so far. I must admit I am far from a Git Guru so assume my knowledge is very limited :)

What I've done:

  1. Clone the repo from Azure DevOps to a local machine using git clone --mirror
  2. run python $gfr --invert-paths --path-glob '*/cache.abf' and python $gfr --invert-paths --path-glob '*.pbix' (these are the file types that hold the data)
  3. Clone the bare repo locally to a new folder so I can inspect the results. When I browse the Git history it seems the commits with the mentioned files are indeed gone but the pack file is still 13.2gb.

The logging for your tool gives me the impressions that any old and unneeded files are cleaned before repacking but maybe I've missed some flag or git command I'm supposed to run.

image

newren commented 4 months ago

It sounds like you took a guess at what was taking up space and removed some files, but a lot of your space is in other files. Run python $gfr --analyze from your project, and look at the files in the created $GIT_DIR/filter_repo/analysis report directory after the run. It should tell you what is large.

KoningLeon commented 4 months ago

I knew the files that were the problems because they are the only ones that hold data. The repo now only takes up 80 mb after we remove the troublesome files. Somehow though it's not reflecting in the .pack file shrinking.

I did however take up your advice and ran the analyze command and I might have found something that could explain why the pack isn't shrinking. Some large files still show up in the path-all-sizes as < present > even though the files and folder are no longer part of the repo image

And the same goes for the directories-all-sizes. The marked folders are no longer part of the repo, yet they are still marked as < present >. image

KoningLeon commented 4 months ago

Managed to get the desired result by doing:

  1. git clone --depth 2000
  2. Delete the entire Repo/PowerBi folder
  3. Run the git-filter-repo as per my original post
  4. Place back the Repo/PowerBI folder

Resulting in our repo going from 13.2gb to 150mb. This means losing the entire history for that specific folder but that is a sacrifice were are willing to make.

newren commented 4 months ago

Any chance you were using CMD to run your commands? If so, the problem may be that you used single quotes (') instead of double quotes ("). If you changed your command from:

python $gfr --invert-paths --path-glob '*/cache.abf' --path-glob '*.pbix' 

to

python $gfr --invert-paths --path-glob "*/cache.abf" --path-glob "*.pbix" 

that might have fixed things for you. Apparently (as I learned in #435), the former will cause CMD to tell git-filter-repo that you want to remove files matching '*/cache.abf' and '*.pbix', which you obviously don't have any of, while the latter correctly tells git-filter-repo that you want to remove files matching */cache.abf and *.pbix.

To my knowledge, this is unique to CMD; single quotes work fine in any other shell and don't do this crazy weirdness.

KoningLeon commented 4 months ago

No, I was using the Powershell terminal from within VScode.

newren commented 4 months ago

Well, in that case, I'd suggest adding a --debug flag to your command so we can see what git-filter-repo actually saw; I have no idea if VScode did some weird interpretation either. And it'd be nice to see the large paths from the --analyze report both before and after you run git-filter-repo with the --debug flag.

That said, it sounds like you did find a solution, so if you don't want to debug further that's fine. But if you'd like to know what happened, the --debug output is the next piece of output I'd need.

newren commented 1 month ago

No further response so I'll close out. I'm glad you found a solution. If you would like to dig further, feel free to reopen and provide the other bits of info I requested.