rtyley / bfg-repo-cleaner

Removes large or troublesome blobs like git-filter-branch does, but faster. And written in Scala
https://rtyley.github.io/bfg-repo-cleaner/
GNU General Public License v3.0
11.11k stars 549 forks source link

Surprising behavior - history is not preserved correctly. #267

Closed jhdub23 closed 6 years ago

jhdub23 commented 6 years ago

Let's say you have the following git repo:

top
    foo
        a_file
        some_file
    bar
        some_file

You want to remove ONLY foo/some_file, so you do a git rm foo/some_file, and then run bfg. You want to keep bar/some_file, and rely on current files being "sacred". After running, your HEAD looks correct:

top
    foo
        a_file
    bar
        some_file

However, bar/some_file ONLY exists on HEAD. Performing git checkout HEAD~1 shows that bar/some_file is gone (and the log will show that bar/some_file was ADDED on the HEAD). I would expect the entire history of bar/some_file to be preserved, but this is not the case.

rtyley commented 6 years ago

"Your current files are sacred" in that the files in your current commit are not changed by the BFG: by default it will not change the file tree of your current commit. That's as far as that guarantee goes.

As for using the BFG to edit history based on path, see https://stackoverflow.com/a/21172871/438886

jhdub23 commented 6 years ago

This limitation on BFG makes it unusable on our repository. foo/some_file is the very large file that was accidentally pushed, and is a completely separate entity compared to bar/some_file. The only thing they have in common is that they happen to have the same name.

This makes using BFG on a large repository very dangerous, due to the possibility of name collisions in different directories. The larger the repo, the greater the chance of a name collision. Even worse, you don't know that you've accidentally invalidated some commits in your history (do you really know that the name "some_file" was never used in your entire history (i.e. may not exist in your HEAD because of renames)?

This should be clearly spelled out in the documentation. I assumed "sacred" meant the current commit plus it's history. I just happened to catch the corrupt history, and went back to plain old "git filter-branch". I'll take the run time hit in order to guarantee correctness.