Remove binary blobs from history?

mbrukman commented 9 years ago

The repo is currently 274MB when cloned, and just the src tree (i.e., excluding past versions in .git) is 85MB — the actual code takes up negligible space, and the vast majority of that space is taken up by src/test/resources/gcd-head.zip.

This makes it take an unusually large amount of time to clone the repo, and it will likely grow every time this blob is updated to a new version.

Would it be possible to not version these large files (and ideally remove them from history while there are few forks), and get them from Maven or other sources, pinned to a specific version?

aozarov commented 9 years ago

I am not sure how to exclude past versions of a file (and was unaware that they are actually pulled when repo is cloned). I don't think it is likely that we will have much changes in gcd-head but the plan is to be depended on gcloud SDK which in the future will include the gcd tool and hopefully will be available via maven.

ludoch commented 9 years ago

I think it's about time to do this clean up before we are too successful:-) Never done that, but I guess http://git-scm.com/book/en/v2/Git-Internals-Maintenance-and-Data-Recovery#Removing-Objects will help?

aozarov commented 9 years ago

This sounds dangerous (and I wonder if such a change, that is not a changelist could then be applied back on the github repository...)

In any case, I followed the directions there on my local branch.

$ git verify-pack -v .git/objects/pack/pack-0c7478248833c2e7e29bb2bb0de8670669d2482c. pack-0c7478248833c2e7e29bb2bb0de8670669d2482c.idx pack-0c7478248833c2e7e29bb2bb0de8670669d2482c.pack

$ git verify-pack -v .git/objects/pack/pack-0c7478248833c2e7e29bb2bb0de8670669d2482c.idx | sort -k 3 -n | tail -3 997af117c72074594a71ed6b141f9675709ec948 blob 931373 52784 90364920 484475ed6fc310dc3eaf115309719344fcda6791 blob 89098873 89111817 273395 ac05237a254e2892d0eb3b497d12f54515df775c blob 99807981 99811510 127031733

$ git rev-list --objects --all | grep ac05237a254e2892d0eb3b497d12f54515df775c ac05237a254e2892d0eb3b497d12f54515df775c src/test/resources/gcd-v1beta2-rev1-2.1.1.zip

(gcd-v1beta2-rev1-2.1.1.zip is what we want to prune).

$ git log --oneline --branches -- src/test/resources/gcd-v1beta2-rev1-2.1.1.zip e68feca replace local gcd hack with a fixed version of gcd c79887b Automatically invoke gcd for testing

$ git filter-branch --index-filter 'git rm --ignore-unmatch --cached src/test/resources/gcd-v1beta2-rev1-2.1.1.zip' -- c79887b^.. $ rm -Rf .git/refs/original $ rm -Rf .git/logs $ git prune --expire now $ git count-objects -v

But could not see any difference (I verified and repeated the steps but still no change).

I then found this: https://help.github.com/articles/remove-sensitive-data/ and use it to do the cleanup of the old /gcd-v1beta2-rev1-2.1.1.zip $ git filter-branch --force --index-filter \

'git rm --cached --ignore-unmatch src/test/resources/gcd-v1beta2-rev1-2.1.1.zip' \ --prune-empty --tag-name-filter cat -- --all

But this created a scary/nasty change-list per forked branch (I tried it in my fork) touching all files (this may be safe but was too scary for me, so I dropped it). Not to mention that for this to work all forks has to be re-based (rather than merge) to avoid re-introducing the object again...

aozarov commented 9 years ago

Any more suggestions on how to remove src/test/resources/gcd-v1beta2-rev1-2.1.1.zip which was introduced by this changelist and later got replaced with src/test/resources/gcd-head.zip by this changelist?

jgeewax commented 9 years ago

@ryanseys : Any input here? Is there a "best practice way" to remove large binaries from history? The link @ludoch provided seems pretty straightforward.

We'll be totally re-writing history (which sucks, I know) and therefore fucking over the 11 forks that exist today, but I think making 11 people unhappy is better than having giant binaries hanging around in the git repository...

jgeewax commented 9 years ago

@aozarov

But this created a scary/nasty change-list per forked branch (I tried it in my fork) touching all files (this may be safe but was too scary for me, so I dropped it).

I think thats supposed to happen. You're effectively re-writing history, no?

ryanseys commented 9 years ago

Yep that github link to remove sensitive data looks on point. You are rewriting all history when deleting the references to the files so the SHAs will change and screw over any forks or clones so potentially way more than 11 people. But this is the only way to permanently remove large blobs from your repo.

jgeewax commented 9 years ago

Yep, sounds like a necessary evil :(

aozarov commented 9 years ago

Is it really a necessary evil? When I clone my fork where I applied all this git scary magic I get:

remote: Counting objects: 26043, done.
remote: Compressing objects: 100% (204/204), done.
remote: Total 26043 (delta 17), reused 0 (delta 0), pack-reused 25829
Receiving objects: 100% (26043/26043), 195.57 MiB | 42.28 MiB/s, done.
Resolving deltas: 100% (22834/22834), done.

When I clone the non-fork (GoogleCloudPlatform/gcloud-java) I get:

cloning into 'gcloud-java'...
remote: Counting objects: 70531, done.
remote: Compressing objects: 100% (1818/1818), done.
remote: Total 70531 (delta 1666), reused 0 (delta 0), pack-reused 68705
Receiving objects: 100% (70531/70531), 224.72 MiB | 42.19 MiB/s, done.
Resolving deltas: 100% (65692/65692), done.

So the % difference is not that big and honestly I am not comfortable to do it blindly (as my knowledge in git is very basic, I just follow the steps without really understanding them or their implication).

I still didn't manage to recover my fork which keep saying that it is "This branch is 171 commits ahead of GoogleCloudPlatform:master" (without any file change). Yes, I understand that it is re-writing history (though I would expect less history to be behind not ahead).

Any advice on how to reset my fork to a fresh start would be much appreciated. After that (and after I know how to recover from it) I don't mind to give it another try (and let someone that understand review that push request or better yet do it himself).

jgeewax commented 9 years ago

OK I think the bigger problem is that this library is 200 MB...

That's an order of magnitude larger than gcloud-python, which has been around for quite a bit longer:

Cloning into 'gcloud-python'...
remote: Counting objects: 16892, done.
remote: Compressing objects: 100% (139/139), done.
remote: Total 16892 (delta 91), reused 0 (delta 0), pack-reused 16753
Receiving objects: 100% (16892/16892), 29.53 MiB | 5.69 MiB/s, done.
Resolving deltas: 100% (12957/12957), done.

It also seems weird that the number of objects is so large. From the logs above, it looks like your fork had 26,043, whereas GCP/gcloud-java had 70,531? How did we get a 3x multiplier there? Are we sure there aren't more blobs floating around?

The following stats just don't look quite right to me...

Library	Services	Commits	Start date	Size
gcloud-java	2	259	Nov 2014	224.72 MiB
gcloud-python	3	1408	Jan 2014	29.53 MiB

ryanseys commented 9 years ago

Yeah there's a definite difference.

Before:

After:

$ git clone https://github.com/ryanseys/gcloud-java.git ryanseys-gcoud-java

Results:

Cloning into 'ryanseys-gcoud-java'...
remote: Counting objects: 4017, done.
remote: Compressing objects: 100% (795/795), done.
remote: Total 4017 (delta 1994), reused 4017 (delta 1994), pack-reused 0
Receiving objects: 100% (4017/4017), 788.63 KiB | 0 bytes/s, done.
Resolving deltas: 100% (1994/1994), done.
Checking connectivity... done.

jgeewax commented 9 years ago

@ryanseys : Which commits / blobs did you chop out ?

ryanseys commented 9 years ago

src/test/resources/gcd-v1beta2-rev1-2.1.1.zip
src/test/resources/gcd-head.zip

ryanseys commented 9 years ago

You don't really chop out commits, you chop out files (listed above) from the index and as a result, you affect every commit that follows the earliest commit that added those files to the index.

How to reproduce:

Step 0. Fork and clone repo.

git clone git@github.com:{{username}}/gcloud-java.git

Step 1. Remove src/test/resources/gcd-v1beta2-rev1-2.1.1.zip from index.

git filter-branch --prune-empty --index-filter 'git rm -rf --cached --ignore-unmatch src/test/resources/gcd-v1beta2-rev1-2.1.1.zip' --tag-name-filter cat -- --all

Step 2. Remove src/test/resources/gcd-head.zip from index.

git filter-branch --force --prune-empty --index-filter 'git rm -rf --cached --ignore-unmatch src/test/resources/gcd-head.zip' --tag-name-filter cat -- --all

Step 3. Ignore those files from now on, make a new commit.

echo 'src/test/resources/gcd-head.zip' >> .gitignore
echo 'src/test/resources/gcd-v1beta2-rev1-2.1.1.zip' >> .gitignore
git commit -am 'Remove files'

Step 4. Garbage collect outdated blobs and force push back to your fork.

(WARNING THIS WILL REWRITE HISTORY IN YOUR FORK)

git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin
git reflog expire --expire=now --all
git gc --prune=now
git push origin --force --all

Step 5. Clone it under a new name, enjoy the speed up!

git clone git@github.com:{{username}}/gcloud-java.git gcloud-java-new

Let me know if this doesn't work for you!

jgeewax commented 9 years ago

Rather than Step 4 (delete fork, new repo) could you just do git push origin -f ? Or no?

ryanseys commented 9 years ago

I tried that (and you may try it too) but it didn't seem to work for me (i.e. when I cloned again, it was still huge). I'm not entirely sure why.

jgeewax commented 9 years ago

Bummer so we're going to have to delete the gcloud-java repo and re-create it? Maybe there's a "vacuum" thing for GitHub... ?

ryanseys commented 9 years ago

Sorry, I believe you can accomplish this by garbage collecting the blobs:

git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin
git reflog expire --expire=now --all
git gc --prune=now

I've updated the steps above to reflect this. Let me know if that works for you :)

aozarov commented 9 years ago

The issue is that we do need/use "src/test/resources/gcd-head.zip". It is used for unit-testing and running the example for datastore. We have discussion with the datastore people to get that into maven and then we can just add it as a dependency rather than part of the repository (though when you think about it, user will need to download it if not by cloning the repository by maven the first time it run the tests...., but I it is still better for many other reasons including version history). Until then the only thing we can remove is "src/test/resources/gcd-v1beta2-rev1-2.1.1.zip" (which is the older version of gcd.sh). If you think that is still worth it I will follow the steps above (0-5) to remove just "src/test/resources/gcd-v1beta2-rev1-2.1.1.zip" now and we can deal with gcd-head.zip later. Also, (per @jgeewax comment) I have no idea why the fork has many less objects than GCP/gcloud-java (will these steps help in that regard too?)

ryanseys commented 9 years ago

Well if you're re-writing history, you don't want to do more than once (ideally never) so you shouldn't do it again later when you decide gcd-head.zip belongs in Maven, so you either remove it now or never.

Yes the number of objects dropped to a total of 4017 once I removed both of those .zip files. You can see that here.

I would suggest you focus on moving them into a better place (e.g. maven) and rewriting this repo to remove them from here.

jgeewax commented 9 years ago

What is in gcd-head.zip ? You mentioned it's needed for unit tests, etc -- however none of the other libraries need this.

Is it the local development server for Datastore? If so, we shouldn't running unit tests against a process like that, right? (Our other datastore implementations don't do this AFAIK.)

Assuming we absolutely need this file, would it be wrong to put it into GCS and download it as part of the build and test process? ie:

$ wget http://ourbucket.googleapis.com/gcd-head.zip
$ # Now run the unit tests...?

?

aozarov commented 9 years ago

Yes, gcd-head.zip is the local development server for Datastore (which we are trying to get into maven). I think unit-test against mocks (in all levels) is a bigger compromise (which I may need to take for storage, as their is no local development for it yet), especially for a service with such a rich functionality. When I added it to gitHub I didn't think it is a good idea but considered it as a reasonable temporary solution (as I didn't consider the size of the repository as a real problem). I guess I was wrong. Considering that, I think your idea of consuming it directly from a public place is a good one. I will change my test-helper to do that (and install it locally for next runs). I will do that right after my attempt to solve issue #70 (to avoid conflicts).

jgeewax commented 9 years ago

It seems weird to me that unit tests would require a separately running local server. For higher level (say integration tests?) I'd expect that.. but for unit tests, what we're checking is that "we're calling the API the right way", right?

aozarov commented 9 years ago

If you consider calling the API "the right way" by verifying that signature is correct than nothing needs to be done for Java (or languages that are statically typed). If you want to validate that the content is correct and that you are handling the (correct) response correctly than you have to mimic the service behavior and mocking it is a compromise.

jgeewax commented 9 years ago

I'm not saying check the method signatures. I'm saying check that we are a) sending a request, and b) that the request we're sending (a protobuf?) is what we meant to send.

So a good test would be:

Create a query
Add a filter
Add another filter
Make the request against the mock API server
Get the request that the mock API server received
Deserialize the protobuf
Check that both filters exist in the protobuf (aka, that the second filter isn't blasting out the first)

aozarov commented 9 years ago

See also comments in pull request #76

If you provide the code that validates that the proto your method created is "what we meant to send" then your code and your test can be successful in tests but code will fail in prod because the code and tests were wrong in their assumptions. (e.g. they can both miss a require value/condition).

jgeewax commented 9 years ago

Absolutely. That's what integration tests are for. To test our integration assumptions... right?

aozarov commented 9 years ago

Do you have integration tests that cover all functionality and the various inputs?

Also, as we want the community to be part of this effort do we have a way for contributors to invoke these integration tests?

There are many tests in the code that do not interact with the RPC layer and they clearly are not using the local service.

I disagree with the statement that "That's what integration tests are for" and for that reason many services provides a local (mostly in-memory/in-process) implementation. (see my AE example). Integration tests should normally be done against the real-service (and typically they tend to be at a higher level and are not configured to try all various in/out conditions).

I think that if you have a local implementation of a service you should use it in your unit-tests and save time mimicking with the benefit of being more confident that your code assumptions are correct.

jgeewax commented 9 years ago

Do you have integration tests that cover all functionality and the various inputs?

That's a code coverage question... I don't have the answer off the top of my head. I suppose it continues to improve.

Also, as we want the community to be part of this effort do we have a way for contributors to invoke these integration tests?

Run against an existing project? If they're contributing chances are they have things set up already. This raises another question that @jskeet brought up, which was "Can't we make it free for them to run these tests?" Which I don't have an answer for yet. The free way is to pass unit tests, then get Travis to run the integration tests.

I disagree with the statement that "That's what integration tests are for"

OK, so what is the difference in unit versus integration tests?

for that reason many services provides a local (mostly in-memory/in-process) implementation.

Service	Local-impl for admin API	Local-impl for dev API
App Engine	No	Yes
Compute Engine	No	No
Container Engine	Yes?	Yes (Kubernetes)?
Cloud Storage	No	No
Cloud Bigtable	No	Yes (HBase)
Cloud Datastore	No	Yes (gcd.sh)
Cloud SQL	No	Yes (Any SQL)
Cloud DNS	No	No
BigQuery	No	No
Cloud Pub/Sub	No	No
Totals	9 No, 1 Yes	5 No, 5 Yes

Looking at the table differently:

Service	Local-impl for API client we're building?
App Engine	n/a
Compute Engine	No
Container Engine	n/a
Cloud Storage	No
Cloud Bigtable	No
Cloud Datastore	Yes
Cloud SQL	No
Cloud DNS	No
BigQuery	No
Cloud Pub/Sub	No
Totals	1 Yes, 7 No, 2 n/a

I think that if you have a local implementation of a service you should use it in your unit-tests and save time mimicking with the benefit of being more confident that your code assumptions are correct.

I'd agree with this if the local implementation was in-process. See https://cloud.google.com/appengine/docs/python/tools/localunittesting

If users have spin up a separate process that services requests and handles state, I think that's an integration test, not a unit test.

aozarov commented 9 years ago

OK, so what is the difference in unit versus integration tests?

Integration tests should preferably run against production services, use more realistic data and are normally cover a higher-level view (e.g. correlate data from various input source).

I'd agree with this if the local implementation was in-process.

I don't think that should matter (as long as things are done transparently to the user, which is what the LocalGcdHelper is trying to do). The same way it should not matter where and how does the local implementation store the data. The users do not "spin up a separate process" directly that is done for them by the test helper and I think that is fine (even though it may change in the future). Please try running the tests to get the feel yourself.

Most services I spoke with agree that they should have a local implementation and the reason they don't have one yet is technical not philosophical.

jgeewax commented 9 years ago

The users do not "spin up a separate process" directly that is done for them by the test helper and I think that is fine (even though it may change in the future). Please try running the tests to get the feel yourself.

OK, so let's say we're going to run this separate process transparently for unit testing (though I still don't entirely agree). Should the work to download the local server and kick off the unit tests be done in Java? Or should we have a test-running script that does this for us? (.cmd for Windows, .sh for Mac/Linux)

Most services I spoke with agree that they should have a local implementation and the reason they don't have one yet is technical not philosophical.

I agree.

aozarov commented 9 years ago

Currently the download is done transparently by the LocalGcdHelper (which also keeps a cache of it in a temporary directory - which is effective for all the tests that needs it in the "mvn test" run but could be used by future "mvn test" runs).

I could have done the fetch by maven before invoking the tests but I am not sure why that would be better. Currently it is self contain. The download is a little less than 40MB and its impact in any reasonable network (even when 1 Mbps) is insignificant (especially when run via maven and its network usage).

BTW, one clear advantage of having the local impl as a separate process is that you only need to build one and use it for all languages (and only creating thin stubs per language).

jgeewax commented 9 years ago

I could have done the fetch by maven before invoking the tests but I am not sure why that would be better.

Seems like we'd have to write less code, right? Sort of like how when I run unit tests in Python and it fetches dependencies... I wouldn't fetch a binary dependency using Python code, I'd put it in a shell script or declare it as a dependency or something like that.

It seems odd that we'd be writing Java code to do a bunch of pre-test downloading of a binary dependency...

aozarov commented 9 years ago

I doubt it very much that writing the fetch-if-not-cached/cache part and make it platform independent would take less lines of code/configuration in maven (which I assume will need to invoke ant and use a combination of its tasks for this logic) but I am not a maven expert and could be wrong about it.

We can create an issue for it if you want, though I hope that eventually local gcd would be a maven resource and in that case we would just add it as a dependency and remove the fetch logic entirely.

aozarov commented 9 years ago

Also, another point (that was mentioned before) which is a plus for of having it as part of the test util (LocalGcdHelper) is that people could use it as is and would not need to reproduce the steps in our maven pom for their tests.

aozarov commented 9 years ago

I followed the pruning steps described by @ryanseys see pull request #87

aozarov commented 9 years ago

@jboynes any updates?

aozarov commented 9 years ago

I have applied the following steps (provided above by @ryanseys) on the repository:

git filter-branch --prune-empty --index-filter 'git rm -rf --cached --ignore-unmatch src/test/resources/gcd-v1beta2-rev1-2.1.1.zip' --tag-name-filter cat -- --all
git filter-branch --force --prune-empty --index-filter 'git rm -rf --cached --ignore-unmatch src/test/resources/gcd-head.zip' --tag-name-filter cat -- --all
echo 'src/test/resources/gcd-head.zip' >> .gitignore
echo 'src/test/resources/gcd-v1beta2-rev1-2.1.1.zip' >> .gitignore
git commit -am 'Remove files'
git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin
git reflog expire --expire=now --all
git gc --prune=now
git push origin --force --all

I did not see any change in size yet, but as @ryanseys suggested in #87 github gc is not immediate and it may take few days (and few commits to trigger it) before we see the effect.

aozarov commented 9 years ago

4 days past (and several commits) and I still don't see a meaningful change (~200MB instead of ~240MB).

see:

$ git clone https://github.com/GoogleCloudPlatform/gcloud-java
Cloning into 'gcloud-java'...
remote: Counting objects: 71973, done.
remote: Compressing objects: 100% (536/536), done.
remote: Total 71973 (delta 452), reused 0 (delta 0), pack-reused 71433
Receiving objects: 100% (71973/71973), 225.91 MiB | 1.93 MiB/s, done.
Resolving deltas: 100% (66605/66605), done.
Checking connectivity... done.

jgeewax commented 9 years ago

@ryanseys : Any thoughts here?

@aozarov : Are we sure those big files were removed from history ? Ie, can you go back to the commit and see that they're actually missing?

ryanseys commented 9 years ago

The sure-fire way to get rid of it would be to delete this repo on GitHub's side and create a new one with the same name, then push the local version that has been proven to be small back up to GitHub. I'm not sure why their cache hasn't cleared. Someone could also contact GitHub support.

aozarov commented 9 years ago

@jgeewax - I checked several commits that involved these big files and could not find the files there (not in the list of committed files and not in the state of the repository for that commit). Can we consult the github people?

jgeewax commented 9 years ago

Just pinged GH, will let you know what I hear.

aozarov commented 9 years ago

We identified that the cleanup was not complete as the direction were given before the module refactoring which also saved a reference of the deleted blob under this location gcloud-java-datastore/src/test/resources/gcd-head.zip

I applied the following commands to delete it as well

git filter-branch --prune-empty --index-filter 'git rm -rf --cached --ignore-unmatch gcloud-java-datastore/src/test/resources/gcd-head.zip' --tag-name-filter cat -- --all
git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin
git reflog expire --expire=now --all
git gc --prune=now
git push origin --force --all

I verified the pruning by pushing the content of the pruned local repo to a new remote repository and indeed size was reduced significantly (less than 1MB) and therefore forced pushed the changes.

Unfortunately after the push and after asking github support to gc the repo, size still didn't change much and it seems that there is something specific to github that cause this pruning not to work for us as expected.

Next, I tried using the BFG tool as suggested here.

java -jar bfg-1.12.3.jar --strip-blobs-bigger-than 1M gcloud-java.git
cd gcloud-java.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push

After that a cloned repo was the the size of ~45MB (which though still big is significantly smaller than original size).

aozarov commented 9 years ago

Last clone was only 12MB, so it looks like it keeps shrinking now (and maybe asking one more time from GH support to GC would not be a bad idea). I think we can close/resolve this issue now, right?

jgeewax commented 9 years ago

I'd say yes :)

Thanks Arie !

googleapis / google-cloud-java

Remove binary blobs from history? #11