Reduce size of openforcefield git checkouts by clearing out large deleted files from git history

jchodera commented 7 years ago

Checking out openforcefield from github now pulls down over 200 MB in over 5000 files. What happened here?

davidlmobley commented 7 years ago

I don't see anything "sudden" which is big; from what I can tell from a cursory look it is a bunch of "modest" sized molecule sets and other test sets (MiniDrugBank, a ZINC subset, etc.) we use to examine coverage, plus the AlkEthOH set and other related data. Most of these are single compressed files containing multiple molecules; I'm not seeing anything obviously bizarre (like 4000 mol2 files for individual molecules or some such). I don't have time to dig more at the moment.

jchodera commented 7 years ago

It looks like there's currently ~20M of examples/ and ~100M of utilities/:

lski1962:openforcefield choderaj$ du -sh *
4.0K    LICENSE
8.0K    README.md
 28K    The-SMIRNOFF-force-field-format.md
 28K    devtools
 20M    examples
8.0K    oe_license.txt.enc
 39M    openforcefield
 44K    openforcefield.egg-info
4.0K    rdkit
4.0K    setup.py
 96M    utilities

Lots of these are in utilities/filter_molecule_sets:

lski1962:filter_molecule_sets choderaj$ du -sh *
 28M    DrugBank.sdf
3.1M    DrugBank_CHO_atyped.mol2
 19M    DrugBank_atyped.oeb
2.7M    DrugBank_updated_ff.mol2.gz
2.6M    DrugBank_updated_tripos.mol2.gz
1.5M    MiniDrugBank_ff.mol2
1.1M    MiniDrugBank_ff_withGenerics.mol2
1.5M    MiniDrugBank_tripos.mol2
1.1M    MiniDrugBank_tripos_withGenerics.mol2
8.0K    README.md
4.0K    elements_exclude.txt
2.7M    ff_test.mol2.gz
 12K    filter_molecule_sets.py
 16K    pickMolecules.ipynb
4.0K    remove_smirks_CHO.smarts
4.0K    remove_smirks_simple.smarts
2.6M    tripos_test.mol2.gz
3.9M    updated_DrugBank.mol2.gz

What if we moved some of these larger groups of molecule sets to external repos that we can import for testing?

I'm just concerned that someone who wants to grab the openforcefield toolkit may not need 200M of files just to assign SMIRNOFF parameters.

jchodera commented 7 years ago

Alternatively, I suppose we could just be parsimonious about what we actually install/package inside of conda packages.

hjuinj commented 7 years ago

It might be me, I think one of the notebook on my branch in /examples/forcefield_modification/ was quite big. I have now shrunk it.

davidlmobley commented 7 years ago

Seems like the right way to deal with this is to not package/install things which are not needed. Someone who doesn't want to do more than assign parameters doesn't need all the molecule sets which are for testing/development/etc., for example.

bannanc commented 7 years ago

My most recent pull request removed most of the files in filter_molecule_sets if that helps.

jchodera commented 7 years ago

I think we might have to clear out those files from the git history: https://stackoverflow.com/questions/2100907/how-to-remove-delete-a-large-file-from-commit-history-in-git-repository

jaimergp commented 5 years ago

@t-kimber and I recently used BFG Repo Cleaner (mentioned in the SO question) with great success. It's a Java script, but easy to use.

davidlmobley commented 5 years ago

I've used that before too.

mattwthompson commented 3 years ago

This remains a minor nuisance when needing to pip install git+git://github.com/openforcefield/openff-toolkit.git@param-iter since it clones the repo with all of its history and pip does not support shallow cloning. It takes about 2 minutes on my mediocre residential internet, but still about 20-30 seconds on CI machines that probably have 1-10 gigabit connections. Not the slowest step in CI builds but it adds up when all workflows need to run it.

jaimergp commented 3 years ago

You can do pip install https://github.com/openforcefield/openff-toolkit/archive/param-iter.tar.gz I think!

mattwthompson commented 3 years ago

Oh, hey, that works great! I had assumed that GitHub didn't make/have archives for all branches (i.e. feature branches that are not tagged) but I was wrong. That drops the install time to just a few seconds.

jaimergp commented 3 years ago

AFAIK the "filename" can be any git ref, so even hash commits will work.

openforcefield / openff-toolkit

Reduce size of openforcefield git checkouts by clearing out large deleted files from git history #43