Closed jackjackk closed 3 years ago
I have been wondering about this for a while, so thanks a lot for the suggestion. Do you know how this would affect any branches? Or can you simply merge master into say your new development branch to benefit from the smaller size?
regarding binaries etc.., in principle I agree but having the datafiles for the examples in the tree is much more convenient for novice users. Otherwise, they would have to download a separate set of datafiles and copy them to the right location (or do something like altair where you have a separate repo with the datafiles, see https://altair-viz.github.io/getting_started/installation.html)
finally gotten around to do this. Worked like a charm. All old large files are now removed and repo size is down tremendously.
The current repository size is ~1.2GB. This seems to be mostly due to old binary/data files in the history.
For example, using BFG (see https://help.github.com/en/articles/removing-sensitive-data-from-a-repository for a somewhat similar use case) to delete cPickle, tar.gz, bz2 and csv files (you might want to add more extensions, I have seen also some jars!):
you can go down to ~227MB.
In particular, the deleted files would be:
while the protected files (because belonging to the HEAD commit), would be:
Use caution when applying these changes to the github repo with a push! Just in case, make sure you have a backup. If you decide to proceed with a git push, ask also to clone the repository again to those who might have push rights (to avoid overwriting the history back).
You might also want to check all the binary files currently present in the repository (which are not perfectly suitable to be tracked for changes unless sth like https://git-lfs.github.com/ is used) and do some cleanup before the commands above.
In general, you might want to consider a separate repository for larger datasets/binary files, or other web/db hosting to refer to, as this will keep the code history more maintainable.