Open vorburger opened 10 years ago
So, to summarise your issue:
This kind of question has come up before - eg in https://github.com/rtyley/bfg-repo-cleaner/issues/49#issuecomment-47591961 and the answer is a little subtle:
y.txt
) when you protect a commit - you're protecting folders. If
a folder changes in any way (ie a different file changes), that is enough to remove the protection from earlier versions of that folder.I hope that explanation makes sense. It's slightly more nuanced than I wanted to put onto the main documentation page.
I had the same problem as @vorburger and the solution I came up with was that I produced a list of blob ids I wanted to remove (about 10,000 of them in the end) and asked BFG to remove said blobs. This approach worked but I would not actually recommend it as it requires a respectable amount of scripting and manual labour. I have discussed this approach previously on #51.
@rtyley tx for your answer, I think I (kind of) "get it" now. @suuntala tx for chiming in, very useful & good to know I'm clearly not the only one hitting this Q; we may consider the option of using --strip-blobs-with-ids instead of --strip-blobs-bigger-than (depending on the effort it would be for us to create the "magic shell/git scripts" to produce such a list CORRECTLY.. hm) - or we'll just accept and live with this during our SVN to Git migration.
I too was misled by the documentation
If something questionable - like a 10MB file, when you're telling The BFG to strip out everying over 5MB - is in a protected commit, it won't be removed, and because it's still there, there's no point deleting it from earlier commits either. If you want the BFG to delete something you need to make sure your current commits are clean.
I misread "there's no point" as "there's no point and so it won't do it".
I understand the implementation details may preclude this behavior, but I would have expected that if a file from the protected tree to be kept in earlier commits.
I understand that can be a little fuzzy. In other words, git log --follow my-file
would have the same history after running BFG (except for changed SHA-1s).
@rtyley, this doesn't exactly match my earlier suggestion, but this is close.
This determine the ids of large blobs except for blobs present in HEAD:
(This uses bash and unix utilities. The max size is specified by 1024 * 1024
.)
comm -23 \
<(git rev-list --objects --all | git cat-file --batch-check="%(objecttype) %(objectname) %(objectsize) %(rest)" | grep ^blob | awk '$3 > 1024 * 1024 { print $2 }' | sort) \
<(git ls-tree -r HEAD | cut -f 1 | cut -d ' ' -f 3 | sort) \
> /tmp/large-blobs.list
java -jar bfg-1.12.0.jar -bi /tmp/large-blobs.list
I list all blobs, filter to those more than 1MB, subtract the blobs on HEAD, and output the ids to large-blobs.list
. Then I use BFG to remove those blobs.
Delet all files please
Hello @rtyley , first of all, once again thanks for this amazing tool. Here's feedback of something I'm struggling with - unless I misunderstand, files from protected commits loose their history, show up as if in last commit only? Apologies if this terminology isn't 100% accurate, here's what I mean:
The use case is purging old un-used "big" (mostly binary) files from an originally big (4 GB-ish) repo resulting from a git svn clone import from Subversion. So I so something like:
java -jar ../bin/bfg*.jar --private -b 512K .
- works great, super fast.As there are some files >512k on HEAD, and because "BFG assumes that your latest commit is a good one, with none of the dirty files you want removing from your history still in it." (great, tx), I obviously get some:
Scanning packfile for large blobs: 387045 Scanning packfile for large blobs completed in 2,230 ms. Found 1089 blob ids for large blobs - biggest=653983912 smallest=262726 Total size (unpacked)=5219004150 Found 24785 objects to protect Found 3 commit-pointing refs : HEAD, refs/heads/master, refs/remotes/git-svn
Protected commits
These are your protected commits, and so their contents will NOT be altered:
What's... "sub-optimal" is that e.g. the badaboum file in the repo now appears to have (deleted first and then) created in the last commit - it's history appears to have been lost! :( I'm sure this is for a good technical reason of the current implementation - but is there any way to fix / improve this, or any advise/trick/work around you may have? To illustrate:
git show | grep folder/badaboum diff --git a/folder/badaboum b/folder/badaboum +++ b/folder/badaboum diff --git a/folder/badaboum.REMOVED.git-id b/folder/badaboum.REMOVED.git-id --- a/folder/badaboum.REMOVED.git-id diff --git a/folder/badaboum_template b/folder/badaboum_template +++ b/folder/badaboum_template diff --git a/folder/badaboum_template.REMOVED.git-id b/folder/badaboum_template.REMOVED.git-id --- a/folder/badaboum_template.REMOVED.git-id
Ideally, I would have hope that files like badaboum just... stay wherever they are in the history. Possible?