rtyley / bfg-repo-cleaner

Removes large or troublesome blobs like git-filter-branch does, but faster. And written in Scala
https://rtyley.github.io/bfg-repo-cleaner/
GNU General Public License v3.0
10.83k stars 535 forks source link

New to BFG, running into issues #463

Open jms1voalte opened 2 years ago

jms1voalte commented 2 years ago

I just discovered BFG yesterday. I'm in the middle of migrating 450+ repos from Bitbucket to GitHub, and I need BFG because some of the repos contain huge files.

My process looks basically like this:

git clone --mirror git@bitbucket.org:acct/repo
cd repo.git
bfg --convert-to-git-lfs "*.{cfs,pdf,xlsx}" --no-blob-protection
git reflog expire --expire=now --all && git gc --prune=now --aggressive

gh repo create org/repo
git remote set-url origin git@github.com:org/repo
git lfs install
git push origin

Issues:

If it makes any difference, I'm using bfg 1.14.0 on macOS 12.3.1, installed using Homebrew.

jms1voalte commented 2 years ago

ALSO ... it would be nice if there were an option to NOT add anything to the repo. I'm perfectly capable of creating or updating a .gitattributes file afterward (especially one that will actually work, since git doesn't recognize glob patterns with {...} in them), and I don't really need a copy of the report within the repo itself, especially if they're going to be added to some random branch rather than to the repo's primary branch.

jms1voalte commented 2 years ago

I'm experimenting with the git lfs migrate command (part of the git-lfs package) ... apparently the issue with the repo's "primary branch" being changed has to do with how GitHub handles initial pushes from bare repos into an empty repo, because ...

MattMicheletti commented 1 year ago

For anyone else that comes along in this issue, I thought I would address some of the points as I had to use BFG Repo-Cleaner for migrating some repositories' history to Git LFS due to severe performance issues with the standard tooling provided by Git LFS. Some of the issues raised by the OP are legitimate issues with BFG Repo-Cleaner but they all have acceptable workarounds for anyone who needs to use BFG Repo-Cleaner. Other issues raised are either documented Git functionality or otherwise unrelated to BFG Repo-Cleaner. I wanted to clear up any misconceptions or misunderstandings future readers may have when coming across this issue.

The BFG documentation doesn't mention running git lfs install before pushing data into the newly created GitHub repo, however if you don't do this, the git push command won't push anything to the LFS server, and GitHub will complain about (or fail because of) large files, as if BFG had never been run at all.

This is not the fault of BFG Repo-Cleaner. This is a standard step required for setting up Git LFS to integrate seamlessly with Git. The Git LFS download/home page covers this.

If I use git clone to make a local clone of the bare repo directory after BFG does its thing, that clone ends up being on some random branch, rather than the correct "primary" branch (i.e. master, main, etc.). The files that BFG added will be in this branch, and not in the correct "primary" branch.

This is unrelated to BFG Repo-Cleaner. Git's cloning command has a flag for specifying the branch to checkout (-b see: https://git-scm.com/docs/git-clone). It explicitly documents this exact point (i.e. Git's clone functionality does not guarantee the master or main branch is checked out by default). Bare repositories also their own semantics and eccentricities, so I would recommend reading up on those separately. As well, anytime we are doing anything with Git, we want to make sure we're checking out the correct branch manually before doing any operation.

If I create an empty GitHub repo and git push into it from the same bare repo directory after BFG does its thing, GitHub uses the same random branch as the "primary" branch, and anybody who clones the repo from there also ends up with the same bogus primary branch.

This is also unrelated to BFG Repo-Cleaner. This is standard Git functionality as git push depends on the current branch checked out. Please see https://git-scm.com/docs/git-push for more details regarding the API for git push.

The .gitattributes file that BFG creates, seems to be artificially injected into the content of an existing commit, somewhere within the parent chain of whatever "primary branch" BFG chose. Sometimes the selected commit also happens to be in the parent chain of the "correct" primary branch, but sometimes it's not.

The "inject a .gitattributes file with the contents of the expression provided to BFG Repo-Cleaner" is the intended functionality in so far as I can find in the code for the Git LFS support in BFG Repo-Cleaner. Granted, the expression may not be a valid .gitattributes expression, the functionality is operating as designed. Clearly, it needs refinement based on the fact that the expression provided can be, and often will be, far more expressive than the expressions accepted by .gitattributes as valid.

As for the branch the commit is made on, it should be based on the branch of the commit. Otherwise, that is standard Git functionality when modifying commits.

Using a complex pattern like --convert-to-git-lfs '.{csv,pdf,xlsx}' works for BFG (in that all of those files end up being converted to LFS objects), but the pattern doesn't work in a .gitattributes file. The BFG command should probably support multiple patterns in the --convert-to-git-lfs option (i.e. .csv,.pdf,.xlsx), or support multiple instances of the --convert-to-git-lfs option on the command line. (And if it already does one of these, or has some other mechanism to allow multiple patterns, the documentation should be updated to explain how to do this.)

This is indeed a bug in the BFG Repo-Cleaner's implementation for modifying history to support Git LFS. The expression provided to BFG Repo-Cleaner is going to be far more powerful and expressive than those supported by .gitattributes. A work around is to run BFG Repo-Cleaner again after running it for Git LFS conversions and use the --delete-files ".gitattributes" option instead. This way the problematic .gitattributes file(s) can be removed from history and manually inserted where desired by hand. I recommend using git filter-repo as it's extraordinarily fast and performance oriented (see: https://github.com/git-lfs/git-lfs/issues/3543#issuecomment-1019633715 for an example command to insert a manually created .gitattributes file into the root commit(s) of your repository: git filter-repo --force --commit-callback "if not commit.parents: commit.file_changes.append(FileChange(b'M', b'.gitattributes', b'$(git hash-object -w .gitattributes)', b'100644'))").

The .gitattributes file doesn't have a newline at the end. I'm not sure if this alone will cause problems for git commands, but it will definitely cause problems if a later script does something like echo xxx >> .gitattributes. I'm also not sure what BFG does if a repo already contains a .gitattributes file in the root of the repo (i.e. does it overwrite the existing file, add its line to the end of the file, or ... ?)

Again, this is indeed due to how the .gitattributes file(s) is created/modified by BFG Repo-Cleaner. However, using the aforementioned workaround of inserting a manually constructed .gitattributes file should negate this problem by definition.

The files in the repo.bfg-report/ directory also appear to have been artificially injected into an existing commit, but it's a different commit than where the .gitattributes file was committed.

I have not run into this problem myself but the final bfg-report directory can be deleted manually once a BFG Repo-Cleaner operation completes. If there are commits with the bfg-report directory in them, then I would recommend to use the --delete-folders "bfg-report" option after running a BFG Repo-Cleaner operation to remove them, as a workaround similar to the --delete-files ".gitattributes" mentioned earlier.

I hope these tips help anyone else running across the same kinds of issues. I would try git lfs migrate import ... if you can and see if it works for your repository otherwise BFG Repo-Cleaner is the best fallback option. At least it was in my case. Best of luck to everyone :)

jms1voalte commented 1 year ago

Thanks for responding.

krachynski commented 1 year ago
  • For the "random branch" thing ... using git clone --mirror doesn't allow you to specify which branch to check out, because if the copy you're making is into a bare repo, there is no "current branch" because there is no working directory. In this case, BFG seems to choose a random "tree" of commits and inject its changes into the root of just that tree, rather than into all root commits (i.e. commits without any parents, as created using git checkout --orphan.)

However, the HEAD file should be consistent on what would be checked out if this wasn't a bare repository. Does that at least match up with where you saw work being done?

  • It would be nice if the "final report" was written to normal files outside the repo, rather than being injected into the repo contents ... or at the very least, offer an option to do this.

As it is, the documentation is geared towards running bfg from outside your git repository by default so the reports are a peer of the repository instead of embedded within it.

krachynski commented 1 year ago
  • Using a complex pattern like --convert-to-git-lfs '*.{csv,pdf,xlsx}' works for BFG (in that all of those files end up being converted to LFS objects), but the pattern doesn't work in a .gitattributes file. The BFG command should probably support multiple patterns in the --convert-to-git-lfs option (i.e. *.csv,*.pdf,*.xlsx), or support multiple instances of the --convert-to-git-lfs option on the command line. (And if it already does one of these, or has some other mechanism to allow multiple patterns, the documentation should be updated to explain how to do this.)

Definitely could use better documentation on this feature. I wasn't even sure how to try multiple file types before I found this issue.

What I'm seeing here is that the wild cards aren't actually working for me unless I specify only one type of file --convert-to-git-lfs "*.dll". Plus, once I've run this with one file, running again with a second file type *.jar claims there is nothing to process. Which is wrong since I already know that I have both file types in my repository.

If it makes any difference, I'm using bfg 1.14.0 on macOS 12.3.1, installed using Homebrew.

I'm using bfg 1.14.0 on Windows 10 in PowerShell 7 (which might explain my wildcard behaviour).

AraHaan commented 1 year ago

I wish I could know how to use this to clear any .exe, .dll, .pyd, .c files that snuck into my python project's github history (https://github.com/DecoraterBot-devs/DecoraterBot).

This is because it's became a pain to manually copy the .git directory over to a vps because then it is a long wait for it to fully transfer (for 513 commits in it).