rtyley / bfg-repo-cleaner

Removes large or troublesome blobs like git-filter-branch does, but faster. And written in Scala
https://rtyley.github.io/bfg-repo-cleaner/
GNU General Public License v3.0
10.95k stars 540 forks source link

Potential bug: more files are purged than expected #346

Open nishaatr opened 4 years ago

nishaatr commented 4 years ago

I ran:

java -jar bfg.jar --no-blob-protection --strip-blobs-with-ids allblobs.txt test-repo.

The allblobs.txt file contained one blob sha1 for testing.nupkg (e69de29bb2d1d6434b8b29ae775ad8c2e48c5391). This file was added but never updated and size was 0 bytes. However, I see the below output indicating text113.txt was also deleted:

    Filename        Git id
    ------------------------------
    testing.nupkg | e69de29b (0 B)
    text113.txt   | e69de29b (0 B)

The end result was that many commits were updated. If I run the command with --delete-files:

java -jar bfg.jar --delete-files testing.nupkg test-repo --no-blob-protection

Then it only deletes the specified file:

    Filename         Git id
    -------------------------------
    testing2.nupkg | e69de29b (0 B)

So looks like --strip-blobs-with-ids behaves differently and its concerning that BFG simply removes the file without any warning. One reason this could be is that Git reuses the same blob as both were 0 bytes and BFG simply is not checking filenames. That said not sure how BFG would know what files to keep and what to delete based on blob id. Nonetheless, the fact that e69de29b was removed, not sure what happens to text113.txt.

Great tool btw and thank you for providing it!

nishaatr commented 4 years ago

Note: reason for using shas and not filenames is because BFG does not support file paths with --delete-files. Unfortunately, this has resulted in a more complicated script that I have written which takes much longer to run due to multiple git.exe calls to get all the sha ids for a given file. This is not a critisism of BFG but hoping file paths will be supported at some point.