rtyley / bfg-repo-cleaner

Removes large or troublesome blobs like git-filter-branch does, but faster. And written in Scala
https://rtyley.github.io/bfg-repo-cleaner/
GNU General Public License v3.0
10.95k stars 540 forks source link

`bfg --replace-text` created new commits without sensitive data but left the previous commit #302

Open ggrelet opened 5 years ago

ggrelet commented 5 years ago

I runned the following on a --mirror clone of my repository:

$ bfg --replace-text passwords.txt  my-repo.git
$ cd my-repo.git
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
$ git push

So BFG created new commits with proper timestamp for each commit I had containing sensitive passwords. But it didn't replace the commits, it just created clean duplicates along with the original bad commit. So I have now twice as much commit as I had...

I can fetch all the bad commit SHA using:

$ git grep "my-password" $(git rev-list --all) | cut -c1-40

How can I permanently remove those commits on my remote repository? Is it a good practice?

Thanks for your help

javabrett commented 5 years ago
ggrelet commented 5 years ago
  • How is your remote repository hosted (server etc.)?

It's hosted on a company private Gitlab server

  • Do you know how to force a remote gc?

No.

  • Did git push issue any errors?

No it didn't.

  • Via which branches/refs are your old/new commits available?

I'm not sure I understand your question. Every commit containing a sensitive "password" now has its doppelgänger without the password but the strinf ***REMOVED*** instead. This is true for every branch. You want me to check if the files have the same ref?

Thanks for your help on this.

javabrett commented 5 years ago

You'll need to work out what works for you in Gitlab. Many Git servers hold long references to history e.g. as pull/merge requests etc. You might find these comments helpful.

If everything worked properly, all your local-clone branches, and their matching remote branches e.g. master etc., will now reference the new histories sans passwords. If the old commits are still present, then they are either now dangling and can eventually be gced, or (unfortunately) more likely, there is some hidden/management ref in your remote that still references them. For that reason, a lot of repos on managed remote servers seem to need replacing rather than a simple force-push.

7yl4r commented 5 years ago

I am using github and seeing the same thing. The linked workaround might work, but I think bfg is failing to clean something it needs to in the local repo.

# 1. clean the repo 
java -jar ~/bfg-1.13.0.jar --replace-text passwords.txt
git reflog expire --expire=now --all && git gc --prune=now --aggressive

# 2. update remote
git push -f origin master

# repeat 1 & 2 for each branch

# search for password usage in local repo
git grep $MY_PASSWORD $(git rev-list --all) | cut -c1-40
# (it's still there)

Even worse, a fresh clone of the repo has passwords still in it.

git clone https://github.com/username/my-repo/
git grep $MY_PASSWORD $(git rev-list --all) | cut -c1-40
# hey, there's my password in some dangling refs!

You can access these files via github at https://github.com/username/my-repo/blob/REF_HASH_HERE/path/to/file.

Note that there are no passwords remaining in git rev-list --branches *

git grep $MY_PASSWORD $(git rev-list --branches *) | cut -c1-40
# empty!

Does this mean those revs have no branch but are for some reason not cleaned by git reflog expire --expire=now --all && git gc --prune=now --aggressive?

javabrett commented 5 years ago

@7yl4r Best to do some more problem-isolation, especially checking whether you have succeeded in eliminating the desired data locally, after your step 1., before attempting any remote push/clones and tests. That way you can isolate whether BFG is not cleaning the commits properly, or whether you are not able to get the new state into the remote. This is often the case with GitHub or other remotes with hidden/read-only pull-request refs.

Make sure you follow the docs carefully to mirror the remote before running BFG. Post all commands here if still having trouble.

OrderConcept commented 4 years ago

My repo is on a github and I have the same problem as OT

"So BFG created new commits with proper timestamp for each commit I had containing sensitive passwords. But it didn't replace the commits, it just created clean duplicates along with the original bad commit. So I have now twice as much commit as I had..."

I run the following commands:

$ git clone --mirror https://github.com/user/my-repo.git
$ java -jar bfg-1.13.0 --replace-text passwords.txt  my-repo.git
$ cd my-repo.git
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
$ git push

How can I permanently remove the "original bad commits (with passwords)" on my remote repository?

Thanks in advanced

7yl4r commented 4 years ago

I have attempted to create a minimal bash script to reproduce this issue and was not successful:

# setup
mkdir tmp_repo
cd tmp_repo/
git init

# commit the pw to repo
echo "password=123456;" > file.txt
git add file.txt 
git commit -a -m '+ test file'

# remove pw from HEAD
sed -i 's/123456/*******/g' file.txt
git commit -a -m 'clean pw from HEAD'

# use bfg
echo "123456" > passwords.txt
java -jar ~/bfg-1.13.0.jar --replace-text passwords.txt
git reflog expire --expire=now --all
git gc --prune=now --aggressive

Now you can search for the password (123456), and see it is not there.

git grep 123456 $(git rev-list --all) | cut -c1-40
# no output

Any thoughts on how to expand this to reproduce the issue? Perhaps the issue has to do with the remote or the use of branches?

rrotondo commented 11 months ago

I confirm the issue. Moreover, in case you have issue on gitlab referrring to previous commit they are not deleted and file can still be browsed. Maybe it's a cache problem but sensitive data are still reachable.