rtyley / bfg-repo-cleaner

Removes large or troublesome blobs like git-filter-branch does, but faster. And written in Scala
https://rtyley.github.io/bfg-repo-cleaner/
GNU General Public License v3.0
11.07k stars 547 forks source link

Not all files removed #221

Open richardgavel opened 7 years ago

richardgavel commented 7 years ago

I am running BFG with a specific filename (real use case involves multiple wildcards, but trying to narrow down the issue). My understanding of how BFG works is that all connections to the blob are removed from historical tree nodes (I know it goes thru commits, but those commits just point to tree objects). Those blob objects then become unreachable and available for garbage collection.

When I run BFG, the report lists 73 unique blobs, but when I run git fsck --unreachable, there are only 25 unreachable blobs (along with the unreachable tags, commits, trees that are the original versions of those objects that were rewritten).

So why is BFG saying it deleted a file when it's still reachable?

javabrett commented 7 years ago

Is the problem by any chance reproducible on a public repo?

Can you give any BFG output/logging, and commentary on files/blobs you expected to be deleted but remained reachable?

richardgavel commented 7 years ago

I think I finally figured out the issue. Part of the pain dealt with the fact that the repo properly shrunk when run against a mirror clone, but not the original repository file system.

I basically ran git rev-list objects for each ref in /refs and directed it to a file. Then I searched through all of them to find the objects I thought should no longer be there. The common thread is that they are all the refs that still can reach those objects are in /refs/pull-requests.

However, that didn't explain why they got purged against the mirror clone, until I noticed that while those refs were in packed-refs in the clone, they weren't in the packed-refs for the original file system. Is it possible that the refs you're getting via the jgit library are going straight to the packed refs and ignoring the individual /refs files?

Another possibility too. I took a look at one of the ref files in question (/refs/pull-requests/1395/merge), and it's content is "ref: stash-refs/pull-requests/1395/merge". A ref that points to another file. There's another difference between that and the clone, the clone traverses the relationship and packs the actual commit ID.

javabrett commented 7 years ago

Is it possible that the refs you're getting via the jgit library are going straight to the packed refs and ignoring the individual /refs files?

Yes IIRC some BFG JGit calls do this, that is they traverse the packs, for efficiency. I suppose that's why the instructions specify performing a mirror clone first, which ensures everything is packed already. Probably not enough weight is placed on this being required for correctness. You could also gc or repack your local repo.

A ref that points to another file.

Issue with pointer refs and things like notes wouldn't surprise me - I think these have been noted previously. Ref structures created by management systems such as pull-requests can also cause issues by retaining history.

richardgavel commented 7 years ago

Yeah, we've got the entire history of pull requests in that real folder. We could manually rewrite them all with the log output of BFG. But I think we're going to just save the old repo for historical purposes and use the shrink repo to start fresh.

On Thu, Aug 24, 2017, 11:59 PM Brett Randall notifications@github.com wrote:

Is it possible that the refs you're getting via the jgit library are going straight to the packed refs and ignoring the individual /refs files?

Yes IIRC some BFG JGit calls do this, that is they traverse the packs, for efficiency. I suppose that's why the instructions specify performing a mirror clone first, which ensures everything is packed already. Probably not enough weight is placed on this being required for correctness. You could also gc or repack your local repo.

A ref that points to another file.

Issue with pointer refs and things like notes wouldn't surprise me - I think these have been noted previously. Ref structures created by management systems such as pull-requests can also cause issues by retaining history.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rtyley/bfg-repo-cleaner/issues/221#issuecomment-324824200, or mute the thread https://github.com/notifications/unsubscribe-auth/ADsqD-1gg2k1KFMPg6rWGhFoBLI0JpLGks5sblSXgaJpZM4OIElu .