rtyley / bfg-repo-cleaner

Removes large or troublesome blobs like git-filter-branch does, but faster. And written in Scala
https://rtyley.github.io/bfg-repo-cleaner/
GNU General Public License v3.0
11.07k stars 547 forks source link

Feature request/idea: dry-run mode #17

Open tfnico opened 11 years ago

tfnico commented 11 years ago

In order to preview which files would be removed with -b, I first used some perl script to see which files would be deleted. However, it would seem practical if BFG could be run in dry-mode to see what the output would be, without actually doing any changes in the repo.

Of course, it's also easy to just make another clone to do the test-run first. But if it's easy to implement dry-run, why not.

rtyley commented 11 years ago

Hi there @tfnico - hmm, my reply to this got surprisingly long, which is weird given how simple the feature sounds (I guess this is probably an indication of how obsessive I am about this stuff at the moment).

Ok, to define the feature story:

As a user, I'd like to be able to get useful feedback on what the BFG would do if executed with the supplied settings, but without the BFG actually changing the state of the repo ...so that I feel more confident about experimenting with the BFG, and (ideally) enjoy a faster feedback loop than having to delete the result of my experiment and re-grab a copy of the original repo every time I try something out.

It's possible to do a small imperfect chunk of this without any problem - if we're talking specifically about the -b switch (ie --strip-blobs-bigger-than) then we can already scan the Git packfile in advance to work out the hash-ids of objects bigger than that limit, and display them to the user - but that only gives you the hash ids, not the file names or file paths, which is not very friendly.

Once we're talking about evaluating the results of other operations, ie the --replace-text flag, we're basically saying we have to run The BFG for real, but we don't want it to update refs (easy) or indeed write cleaned objects to the repo object database (gives me pause... it's possible, so long as none of the cleaners ever want to examine the contents of previously cleaned objects... should be ok, I think). So we're probably going to end up with a execution run-time that is very very close to just running the BFG for real, but it does at least mean the user doesn't have to wipe and re-copy their Git repo for every experiment.

Perhaps surprisingly, given the identical runtime, this does mean the user ends up in the position that in some ways they have less diagnostic information than if they're just run the BFG for real, because now they can't actually examine the cleaned commits... which means the diagnostic output from the BFG needs to be beefed up - although I'm quite proud of the diagnostic output that The BFG does supply, it could still do with improvement, ie some variant of the stuff in #14 (display diffs of changed content) and #15 (log detailed diagnostics to file) to make --dry-run genuinely useful.

tfnico commented 11 years ago

@rtyley Sounds great. It would certainly be cool to output diffs for replaced texts, and lists of files that have been deleted.

I think rewriting without changing the refs sounds like a fair compromise. Performance is smooth anyhow.

alistra commented 10 years ago

I second that feature, I'm just afraid to use it on a big company repo, without the dry run option

rtyley commented 10 years ago

I second that feature, I'm just afraid to use it on a big company repo, without the dry run option

Hi @alistra - just so I can better understand the use-case, is there any reason you can't just do a git clone --mirror on the repo, and run The BFG on that local copy?

alistra commented 10 years ago

it's harder to see the changes, I would have to manually browse around 6 branches (that are around 2 year old), would be nicer just to go through the changes list with branch/file pairs and check if we wouldn't delete something important accidentally.

Not all of the code in the repo is used all the time, so the mistake wouldn't be obvious right away.

rtyley commented 10 years ago

I would have to manually browse around 6 branches (that are around 2 year old), would be nicer just to go through the changes list with branch/file pairs and check if we wouldn't delete something important accidentally.

Would you want to check every single commit on those branches (which potentially could be a lot of very repetitive information) or would you be interested in just the tips of those branches, ie the latest commit on each branch?

alistra commented 10 years ago

Ideal solution would deduplicate the same files and tell me:

would remove file dir/dir2/foo [56 commits]
would remove file dir/bar [22 commits, 3 branches]

Then in order of usefulness I would like the checking the tips of branches, then the whole big dump of data

arturhoo commented 10 years ago

I don't necessarily need a --dry-run option, but it would be handy to have a list of the file names and paths that were removed after running bfg.

I am also afraid of accidentally deleting a big file that could be useful in the future (although not present in my most recent commit).

danijar commented 10 years ago

Dry run would be very useful in my opinion, too.

dandv commented 10 years ago

:+1: for dry runs. I'm an intermediate git user, and what I'd like is to:

  1. Simulate the bfg operation
  2. Check out the repo as it would be after pushing it and cloning.

That way I can compare the bfg'ed repo with one of known quality and ensure files are fine.

winny- commented 9 years ago

:+1: I literally downloaded bfg and looked for a --dry-run flag before actually deleting the blobs, not to find it.

lwcolton commented 9 years ago

+1

javabrett commented 9 years ago

As eluded-to in comments above, I think this request decomposes into:

Also as commented above, I'm trying to understand the advantage of a --dry-run over simply making a super-cheap local clone of the target repository, and running bfg on that as the dry-run. It's better than a dry-run, since you get a risk-free look at what bfg will actually output, with a detailed report, rather than a simple report of intent.

Since (on Linux anyway) git clone will by-default create hard-links when you clone the repo, that step is super-fast even for massive repos, but you can then run bfg on that clone as if it were independent to the original. Please note however that git clone will not give you an independent on-disk backup unless you specify the --no-hardlinks option to prevent hard-links from being created between your two local repos' object stores.

Say I pick a decent-size repo, the Linux Kernel, and run an academic clean-1M+ on it and time it. First I'll create a hard-linked local clone:

$ time git clone linux linux-bfg-test-run
Cloning into 'linux-bfg-test-run'...
done.
Checking out files: 100% (51567/51567), done.

real    0m4.562s
user    0m3.438s
sys 0m1.120s

... then run bfg:

~/git/linux-bfg-test-run(master) $ time java -jar ~/Downloads/bfg-1.12.5.jar -b 1M
...
Cleaning
--------

Found 547745 commits
Cleaning commits:       100% (547745/547745)
Cleaning commits completed in 268,733 ms.
...
Updating 288 Refs
-----------------
real    5m13.075s
user    10m37.739s
sys 1m47.214s

5 seconds for the clone, 5m13s for the bfg-run, total 5m18s.

Check the original repo and it is untouched. Check the object-store in linux-bfg-test-run and note that lots of objects have been unpacked due to the bfg rewrite. Run the recommended git reflog expire --expire=now --all && git gc --prune=now --aggressive followed by a git repack -a -d and note that new packs have been created, and the hard-link count on the original repo's packs has dropped from 2 to 1.

Compare this to a plain run on the repo:

~/git/linux $ time java -jar ~/Downloads/bfg-1.12.5.jar -b 1M
...
Cleaning
--------

Found 547745 commits
Cleaning commits:       100% (547745/547745)
Cleaning commits completed in 258,862 ms.

Updating 288 Refs
-----------------
...
real    5m1.818s
user    10m23.116s
sys 1m41.897s

About the same, 5m and change. So for the low-cost (5 seconds) of a local clone, you can do a real-test-run rather than a dry-run/report. Of course Windows users, not having hardlinks available, would have to wear the extra time and space cost of the initial clone. Also of course, you will have to pay for the disk-space usage as bfg writes to the test-repo.

So it feels like a native --dry-run option would only be attractive if in being a dry-run, a lot of bfg execution-time could be saved, versus the benefit of getting a real look at the output. I know that I would much prefer to see/test/inspect a real output before I run this for real, rather than a dry-run report.

Zitrax commented 8 years ago

I was also looking for a dry-run. But when not finding it I hoped the real run would print out some info, but I only saw a list of updated refs, it would be very helpful if it actually printed exactly what files/folders were deleted in addition. ( I was using --delete-folders )

Tails commented 7 years ago

My repo is huge and it would be nice to not have to make a copy of it. Very scared to run without a dry run!

javabrett commented 7 years ago

See also my comment above, but consider the following:

That is, test local clones are incredibly cheap, it is necessary to run BFG to really see what it will achieve, ergo it is better to actually run it, and there is dubious value in a dry-run mechanism.

ghost commented 7 years ago

That is, test local clones are incredibly cheap...

I like this feature idea because sometimes ^^^ is not as true as we'd like -- I'm trying to remove junk data from a repo that is 3.9 GB when checked out. (Full disclosure: I didn't do it! I'm trying to fix it : )

...dubious value in a dry-run mechanism

I admit that even thought this feature would be nice it definitely falls under the "nice to have" category.

lovesegfault commented 5 years ago

I'd like to reinforce the need to a dry-run mode, I'm cleaning up a massive repo and it's painful to have to clone it twice to use bfg

javabrett commented 5 years ago

@lovesegfault so we can put numbers to this ... what are the timings if a) first clone remote? To local then second clone local to local perhaps allowing hard links.

lovesegfault commented 5 years ago

@javabrett First clone takes a good 30mins

fabb commented 5 years ago

Maybe try git clone —reference for the second clone and run bfg on this second one for test? (I.e. don‘t run bfg on the first one so you can make a reference clone from it again later)

builder-main commented 3 years ago

Well, when working on tens of gigs repos (like Unity/Unreal) you'd be happy to have a --dry-run saving lots of time. Meanwhile we'll try the --reference option.