jonluca / Anubis

Subdomain enumeration and information gathering tool
https://jonlu.ca/anubis/
MIT License
1.18k stars 151 forks source link

reduce repo size for quick clones #31

Closed wbob closed 5 years ago

wbob commented 5 years ago

When cloning the repo on a small bandwith connection I wondered about the relative transfer size vs. the working tree size. When preparing the snap in #22 you must've accidentally added to and then later removed the package blob from the repository. I think there is an argument for a quick clone, even without --depth 1. Removing the snap and a python binary reduces the size from 19M to 3M, and there are other artifacts too.

I can recommend the oneliner at stackoverflow.com#42544963 to list the biggest blobs and than the java util bfg to remove by size or path. It's an easier frontend than git filter-branch.

Disadvantage of the removal is the rewrite of commit hashes since the introduction of each blob. bfg will output a graph for this. People having forked from one of these commits would need to hard reset their forks and rebase. In my opinion this is not an issue with a smaller codebase where most people will prepare a PR from the latest origin/HEAD.

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --key=2 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest \
| grep -vF --file=<(git ls-tree -r HEAD | awk '{print $3}') \
| tac | head -n20
827f720a141a   15MiB anubis_0.8.3+git1.cbabacf_amd64.snap
5d2ef13a6db4  4,3MiB prime/usr/bin/python3.5
cbbc03be80b6  1,4MiB anubis/common_subdomains.txt
6fc2f1710815  117KiB prime/usr/share/doc/python3.5/NEWS.gz
java -jar bfg.jar --delete-files 'anubis_0.8.3+git1.cbabacf_amd64.snap'
java -jar bfg.jar --delete-files 'python3.5'
Earliest                                              Latest
|                                                          |
...................................................Dmmmmmmmm

D = dirty commits (file tree fixed)
m = modified commits (commit message or parents changed)
. = clean commits (no changes to file tree)

Down at 3344K.

optional:

java -jar bfg.jar --delete-files 'common_subdomains.txt'
java -jar bfg.jar --delete-folders 'prime'
java -jar bfg.jar --delete-folders 'stage'
java -jar bfg.jar --delete-files 'temp.txt'

after those deletes:

git reflog expire --expire=now --all && git gc --prune=now --aggressive
du -cs .

With all deletes, the repo is down at 540K.

jonluca commented 5 years ago

This is great. I was actually wondering why the repo size had grown so much. I guess I mistakenly committed those a while back and they remained in the git history.

Thanks for the comprehensive issue. I'll run this right now.

jonluca commented 5 years ago

Repo size should now be down to a more manageable size. Thanks again!