Open jms1voalte opened 2 years ago
ALSO ... it would be nice if there were an option to NOT add anything to the repo. I'm perfectly capable of creating or updating a .gitattributes
file afterward (especially one that will actually work, since git
doesn't recognize glob patterns with {...}
in them), and I don't really need a copy of the report within the repo itself, especially if they're going to be added to some random branch rather than to the repo's primary branch.
I'm experimenting with the git lfs migrate
command (part of the git-lfs
package) ... apparently the issue with the repo's "primary branch" being changed has to do with how GitHub handles initial pushes from bare repos into an empty repo, because ...
git clone bare-dir full-dir
to make a local "full clone" from the newly converted repo, the local "full clone" has the correct primary branch name (in this case master
)./settings/branches
page showed the random branch name as the primary. I was able to fix it by running "gh repo edit --default-branch master org/repo
" before cloning the repo, and there were no .gitattributes
or BFG reports in the repo, so having them magically appear in the wrong branch wasn't a problem.For anyone else that comes along in this issue, I thought I would address some of the points as I had to use BFG Repo-Cleaner for migrating some repositories' history to Git LFS due to severe performance issues with the standard tooling provided by Git LFS. Some of the issues raised by the OP are legitimate issues with BFG Repo-Cleaner but they all have acceptable workarounds for anyone who needs to use BFG Repo-Cleaner. Other issues raised are either documented Git functionality or otherwise unrelated to BFG Repo-Cleaner. I wanted to clear up any misconceptions or misunderstandings future readers may have when coming across this issue.
The BFG documentation doesn't mention running git lfs install before pushing data into the newly created GitHub repo, however if you don't do this, the git push command won't push anything to the LFS server, and GitHub will complain about (or fail because of) large files, as if BFG had never been run at all.
This is not the fault of BFG Repo-Cleaner. This is a standard step required for setting up Git LFS to integrate seamlessly with Git. The Git LFS download/home page covers this.
If I use git clone to make a local clone of the bare repo directory after BFG does its thing, that clone ends up being on some random branch, rather than the correct "primary" branch (i.e. master, main, etc.). The files that BFG added will be in this branch, and not in the correct "primary" branch.
This is unrelated to BFG Repo-Cleaner. Git's cloning command has a flag for specifying the branch to checkout (-b
see: https://git-scm.com/docs/git-clone). It explicitly documents this exact point (i.e. Git's clone functionality does not guarantee the master
or main
branch is checked out by default). Bare repositories also their own semantics and eccentricities, so I would recommend reading up on those separately. As well, anytime we are doing anything with Git, we want to make sure we're checking out the correct branch manually before doing any operation.
If I create an empty GitHub repo and git push into it from the same bare repo directory after BFG does its thing, GitHub uses the same random branch as the "primary" branch, and anybody who clones the repo from there also ends up with the same bogus primary branch.
This is also unrelated to BFG Repo-Cleaner. This is standard Git functionality as git push
depends on the current branch checked out. Please see https://git-scm.com/docs/git-push for more details regarding the API for git push
.
The .gitattributes file that BFG creates, seems to be artificially injected into the content of an existing commit, somewhere within the parent chain of whatever "primary branch" BFG chose. Sometimes the selected commit also happens to be in the parent chain of the "correct" primary branch, but sometimes it's not.
The "inject a .gitattributes
file with the contents of the expression provided to BFG Repo-Cleaner" is the intended functionality in so far as I can find in the code for the Git LFS support in BFG Repo-Cleaner. Granted, the expression may not be a valid .gitattributes
expression, the functionality is operating as designed. Clearly, it needs refinement based on the fact that the expression provided can be, and often will be, far more expressive than the expressions accepted by .gitattributes
as valid.
As for the branch the commit is made on, it should be based on the branch of the commit. Otherwise, that is standard Git functionality when modifying commits.
Using a complex pattern like --convert-to-git-lfs '.{csv,pdf,xlsx}' works for BFG (in that all of those files end up being converted to LFS objects), but the pattern doesn't work in a .gitattributes file. The BFG command should probably support multiple patterns in the --convert-to-git-lfs option (i.e. .csv,.pdf,.xlsx), or support multiple instances of the --convert-to-git-lfs option on the command line. (And if it already does one of these, or has some other mechanism to allow multiple patterns, the documentation should be updated to explain how to do this.)
This is indeed a bug in the BFG Repo-Cleaner's implementation for modifying history to support Git LFS. The expression provided to BFG Repo-Cleaner is going to be far more powerful and expressive than those supported by .gitattributes
. A work around is to run BFG Repo-Cleaner again after running it for Git LFS conversions and use the --delete-files ".gitattributes"
option instead. This way the problematic .gitattributes
file(s) can be removed from history and manually inserted where desired by hand. I recommend using git filter-repo
as it's extraordinarily fast and performance oriented (see: https://github.com/git-lfs/git-lfs/issues/3543#issuecomment-1019633715 for an example command to insert a manually created .gitattributes
file into the root commit(s) of your repository: git filter-repo --force --commit-callback "if not commit.parents: commit.file_changes.append(FileChange(b'M', b'.gitattributes', b'$(git hash-object -w .gitattributes)', b'100644'))"
).
The .gitattributes file doesn't have a newline at the end. I'm not sure if this alone will cause problems for git commands, but it will definitely cause problems if a later script does something like echo xxx >> .gitattributes. I'm also not sure what BFG does if a repo already contains a .gitattributes file in the root of the repo (i.e. does it overwrite the existing file, add its line to the end of the file, or ... ?)
Again, this is indeed due to how the .gitattributes
file(s) is created/modified by BFG Repo-Cleaner. However, using the aforementioned workaround of inserting a manually constructed .gitattributes
file should negate this problem by definition.
The files in the repo.bfg-report/ directory also appear to have been artificially injected into an existing commit, but it's a different commit than where the .gitattributes file was committed.
I have not run into this problem myself but the final bfg-report
directory can be deleted manually once a BFG Repo-Cleaner operation completes. If there are commits with the bfg-report
directory in them, then I would recommend to use the --delete-folders "bfg-report"
option after running a BFG Repo-Cleaner operation to remove them, as a workaround similar to the --delete-files ".gitattributes"
mentioned earlier.
I hope these tips help anyone else running across the same kinds of issues. I would try git lfs migrate import ...
if you can and see if it works for your repository otherwise BFG Repo-Cleaner is the best fallback option. At least it was in my case. Best of luck to everyone :)
Thanks for responding.
For the "random branch" thing ... using git clone --mirror
doesn't allow you to specify which branch to check out, because if the copy you're making is into a bare repo, there is no "current branch" because there is no working directory. In this case, BFG seems to choose a random "tree" of commits and inject its changes into the root of just that tree, rather than into all root commits (i.e. commits without any parents, as created using git checkout --orphan
.)
My thought is, whatever changes BFG is injecting into its chosen commit (I'm still not clear on how it chooses which commit to modify), should instead be applied to ALL commits which have no parents, so the changes will be "visible" in all branches.
Whatever changes BFG makes to a .gitattributes
file, SHOULD BE SYNTACTICALLY VALID for normal git
commands. I don't know what the best solution is, however my thought is that BFG should only allow patterns which are valid in a .gitattributes
file, and should allow multiple patterns, resulting in multiple lines being added to the .gitattributes
file. (Also, the final .gitattributes
file should end with a newline.)
I know that the current behaviour is "as designed", my thought is that the design should be updated. BFG shouldn't be producing .gitattributes
files which cause normal git
commands to throw errors.
It would be nice if the "final report" was written to normal files outside the repo, rather than being injected into the repo contents ... or at the very least, offer an option to do this.
- For the "random branch" thing ... using
git clone --mirror
doesn't allow you to specify which branch to check out, because if the copy you're making is into a bare repo, there is no "current branch" because there is no working directory. In this case, BFG seems to choose a random "tree" of commits and inject its changes into the root of just that tree, rather than into all root commits (i.e. commits without any parents, as created usinggit checkout --orphan
.)
However, the HEAD file should be consistent on what would be checked out if this wasn't a bare repository. Does that at least match up with where you saw work being done?
- It would be nice if the "final report" was written to normal files outside the repo, rather than being injected into the repo contents ... or at the very least, offer an option to do this.
As it is, the documentation is geared towards running bfg from outside your git repository by default so the reports are a peer of the repository instead of embedded within it.
- Using a complex pattern like
--convert-to-git-lfs '*.{csv,pdf,xlsx}'
works for BFG (in that all of those files end up being converted to LFS objects), but the pattern doesn't work in a.gitattributes
file. The BFG command should probably support multiple patterns in the--convert-to-git-lfs
option (i.e.*.csv,*.pdf,*.xlsx
), or support multiple instances of the--convert-to-git-lfs
option on the command line. (And if it already does one of these, or has some other mechanism to allow multiple patterns, the documentation should be updated to explain how to do this.)
Definitely could use better documentation on this feature. I wasn't even sure how to try multiple file types before I found this issue.
What I'm seeing here is that the wild cards aren't actually working for me unless I specify only one type of file --convert-to-git-lfs "*.dll"
. Plus, once I've run this with one file, running again with a second file type *.jar
claims there is nothing to process. Which is wrong since I already know that I have both file types in my repository.
If it makes any difference, I'm using
bfg 1.14.0
on macOS 12.3.1, installed using Homebrew.
I'm using bfg 1.14.0
on Windows 10 in PowerShell 7 (which might explain my wildcard behaviour).
I wish I could know how to use this to clear any .exe, .dll, .pyd, .c
files that snuck into my python project's github history (https://github.com/DecoraterBot-devs/DecoraterBot).
This is because it's became a pain to manually copy the .git
directory over to a vps because then it is a long wait for it to fully transfer (for 513 commits in it).
I just discovered BFG yesterday. I'm in the middle of migrating 450+ repos from Bitbucket to GitHub, and I need BFG because some of the repos contain huge files.
My process looks basically like this:
Issues:
The BFG documentation doesn't mention running
git lfs install
before pushing data into the newly created GitHub repo, however if you don't do this, thegit push
command won't push anything to the LFS server, and GitHub will complain about (or fail because of) large files, as if BFG had never been run at all.If I use
git clone
to make a local clone of the bare repo directory after BFG does its thing, that clone ends up being on some random branch, rather than the correct "primary" branch (i.e.master
,main
, etc.). The files that BFG added will be in this branch, and not in the correct "primary" branch.If I create an empty GitHub repo and
git push
into it from the same bare repo directory after BFG does its thing, GitHub uses the same random branch as the "primary" branch, and anybody who clones the repo from there also ends up with the same bogus primary branch.The
.gitattributes
file that BFG creates, seems to be artificially injected into the content of an existing commit, somewhere within the parent chain of whatever "primary branch" BFG chose. Sometimes the selected commit also happens to be in the parent chain of the "correct" primary branch, but sometimes it's not.Using a complex pattern like
--convert-to-git-lfs '*.{csv,pdf,xlsx}'
works for BFG (in that all of those files end up being converted to LFS objects), but the pattern doesn't work in a.gitattributes
file. The BFG command should probably support multiple patterns in the--convert-to-git-lfs
option (i.e.*.csv,*.pdf,*.xlsx
), or support multiple instances of the--convert-to-git-lfs
option on the command line. (And if it already does one of these, or has some other mechanism to allow multiple patterns, the documentation should be updated to explain how to do this.)The
.gitattributes
file doesn't have a newline at the end. I'm not sure if this alone will cause problems forgit
commands, but it will definitely cause problems if a later script does something likeecho xxx >> .gitattributes
. I'm also not sure what BFG does if a repo already contains a.gitattributes
file in the root of the repo (i.e. does it overwrite the existing file, add its line to the end of the file, or ... ?)The files in the
repo.bfg-report/
directory also appear to have been artificially injected into an existing commit, but it's a different commit than where the.gitattributes
file was committed.If it makes any difference, I'm using
bfg 1.14.0
on macOS 12.3.1, installed using Homebrew.