rtyley / bfg-repo-cleaner

Removes large or troublesome blobs like git-filter-branch does, but faster. And written in Scala
https://rtyley.github.io/bfg-repo-cleaner/
GNU General Public License v3.0
11.02k stars 545 forks source link

wrong dirty files count in protected commit #192

Open bam80 opened 7 years ago

bam80 commented 7 years ago

I'm confused with how such huge amount of dirty files contained in protected commit counted - it even bigger than files in total the commit originally had!

abutirsky@sml-pc-066:~/hi3798mv100/HiSTBLinuxV100R003C00SPC060$ java -jar ~/bfg-1.12.14.jar -b 1M

Using repo : /home/abutirsky/hi3798mv100/HiSTBLinuxV100R003C00SPC060/.git

Scanning packfile for large blobs: 94271
Scanning packfile for large blobs completed in 3,763 ms.
Found 147 blob ids for large blobs - biggest=724343678 smallest=1050624
Total size (unpacked)=2250136105
Found 85703 objects to protect
Found 3 commit-pointing refs : HEAD, refs/heads/hisilicon, refs/heads/master

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit 32fcf8d0 (protected by 'HEAD') - contains 126 dirty files : 
        - sample/castbluetooth/myalsatest (1.4 MB)
        - sample/castbluetooth/sample_castbluetooth (1.4 MB)
        - ...

So it say commit 32fcf8d0 contains 126 dirty files, but actually it contained 5 files in total, and only one of which is durty:

$ git show 32fcf8d0 --stat
commit 32fcf8d0195ce0ac455e16cd40762b00b2e83610
Author: Andrey Butirsky <abutirsky@smartlabs.tv>
Date:   Fri Dec 23 18:12:57 2016 +0300

    hisilicon_patch/RTC_Patch_20161214/SDK_patch/HiSTBLinuxV100R003C00SPC061_QFP_HW_TEST_patch_20161214/

 readme.txt                                                               |   4 +-
 source/component/hiplayer/lib/full/arm-histbv310-linux/libavformat.so.53 | Bin 1088070 -> 1088070 bytes
 source/component/loader/app/upgrd_config.c                               | 443 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------------------------
 source/kernel/linux-3.18.y/drivers/net/ethernet/hieth/phy.c              |   4 +-
 source/msp/drv/vdec/drv_vdec_pts_recv.c                                  |   7 +++
 5 files changed, 249 insertions(+), 209 deletions(-)

After the command execution, the commit became to contain all these dirty files from previous commits, instead of just deleting them:

$ git show  --stat
commit dff5aba5ff99ccc13acf8dd38e0d3b432b09f243
Author: Andrey Butirsky <abutirsky@smartlabs.tv>
Date:   Fri Dec 23 18:12:57 2016 +0300

    hisilicon_patch/RTC_Patch_20161214/SDK_patch/HiSTBLinuxV100R003C00SPC061_QFP_HW_TEST_patch_20161214/

    Former-commit-id: 32fcf8d0195ce0ac455e16cd40762b00b2e83610

 readme.txt                                                                                                                       |      4 +-
 sample/castbluetooth/myalsatest                                                                                                  |    Bin 0 -> 1478990 bytes
 sample/castbluetooth/myalsatest.REMOVED.git-id                                                                                   |      1 -
 sample/castbluetooth/sample_castbluetooth                                                                                        |    Bin 0 -> 1478997 bytes
 sample/castbluetooth/sample_castbluetooth.REMOVED.git-id                                                                         |      1 -
 sample/dolby/res/DroidSansFallbackLegacy.ttf                                                                                     |    Bin 0 -> 3081908 bytes
 sample/dolby/res/DroidSansFallbackLegacy.ttf.REMOVED.git-id                                                                      |      1 -
 sample/factory_detect/res/DroidSansFallbackLegacy.ttf                                                                            |    Bin 0 -> 3081908 bytes
 sample/factory_detect/res/DroidSansFallbackLegacy.ttf.REMOVED.git-id                                                             |      1 -
 sample/gpu/mali700/es20/sphere/sphere_res/stpeters_cross_mipmap_256.dds                                                          |    Bin 0 -> 2097272 bytes
 sample/gpu/mali700/es20/sphere/sphere_res/stpeters_cross_mipmap_256.dds.REMOVED.git-id                                           |      1 -
 sample/higo/res/DroidSansFallbackLegacy.ttf                                                                                      |    Bin 0 -> 3081908 bytes
 sample/higo/res/DroidSansFallbackLegacy.ttf.REMOVED.git-id                                                                       |      1 -
 sample/higo/res/performance.bmp                                                                                                  |    Bin 0 -> 6220854 bytes
 sample/higo/res/performance.bmp.REMOVED.git-id                                                                                   |      1 -
 sample/png/res/Img201006300005_N.png                                                                                             |    Bin 0 -> 1357695 bytes
 sample/png/res/Img201006300005_N.png.REMOVED.git-id                                                                              |      1 -
 sample/pvr/res/DroidSansFallbackLegacy.ttf                                                                                       |    Bin 0 -> 3081908 bytes
 sample/pvr/res/DroidSansFallbackLegacy.ttf.REMOVED.git-id                                                                        |      1 -
 sample/tde/res/Img201006300005_N.bits                                                                                            |    Bin 0 -> 2797584 bytes
...............

Either is something wrong here, or I totally misunderstood how the program works..

javabrett commented 7 years ago

This does not look like a bug. You might try asking on stackoverflow tagged with bfg-repo-cleaner.

bam80 commented 7 years ago

Yes, I realized later it was a feature. Think it should be explained better in the documentation, as it hard to understand at first..

javabrett commented 7 years ago

Your problem here is that you are considering the output of git show --stat, which shows the changes made in that commit, as representing all of the files present in your repository tree at the point-in-time of the commit.

BFG protects a commit by protecting the tree pointed-to by that commit, which is all directories and files at that point-in-time, not just the changes made by that commit.

Please close this issue.

bam80 commented 7 years ago

Yes, I understand now, but obviously it's not only my problem - people seems face with it often. This is the evidence that the documentation should be improved, or at least corresponding issue raised. Only then I would close this issue. @rtyley what do you think?

javabrett commented 7 years ago

Just wanted to check if you have read https://rtyley.github.io/bfg-repo-cleaner/#protected-commits . Do you have perhaps a pull-request or edits you would recommend there?

bam80 commented 7 years ago

Maybe explain more clearly what dirty files is (they are not only files committed in protected commit), and that those dirty files eventually will be "moved" to that last commit, if it actually "contains" them. This behavior is not obvious, and shocks first time you face it

javabrett commented 7 years ago

Actually dirt-files are only protected in a protected commit - but to accept that, you need to accept that a commit in Git points to a tree which is a graph of the files at the point-in-time of that commit. This is how Git works and is irrefutable. You need to dismiss the notion that a Git commit is just a diff from the previous commit.

What is somewhat surprising to new users is that BFG appears to "shift" dirt files to appear as though they were just added in the (now-rewritten) commit which protected them. This has been discussed in numerous issues but is not mentioned heavily in docs. This is caused by BFG protecting only the tree of the protected commit, therefore the dirt is removed from ancestor trees and so the resulting rewritten repo appears to re-add the dirt on the final, protected commit. It appears to be added because Git shows it in a log/diff to the previous commit, whereas in reality the tree has not changed. The change appears because the file's history in ancestor commits was removed.

There is an attempt to address this in #149 , which proposes to protect dirt-blobs (appearing in the tree of a protected commit) in all trees where they are found, not just the protected trees. This would at least push the appearance of the "add" of the dirt file back to the last commit in which it was modified, or if never modified, the commit where it was added (no change).