Open jchodera opened 7 years ago
I don't see anything "sudden" which is big; from what I can tell from a cursory look it is a bunch of "modest" sized molecule sets and other test sets (MiniDrugBank, a ZINC subset, etc.) we use to examine coverage, plus the AlkEthOH set and other related data. Most of these are single compressed files containing multiple molecules; I'm not seeing anything obviously bizarre (like 4000 mol2 files for individual molecules or some such). I don't have time to dig more at the moment.
It looks like there's currently ~20M of examples/
and ~100M of utilities/
:
lski1962:openforcefield choderaj$ du -sh *
4.0K LICENSE
8.0K README.md
28K The-SMIRNOFF-force-field-format.md
28K devtools
20M examples
8.0K oe_license.txt.enc
39M openforcefield
44K openforcefield.egg-info
4.0K rdkit
4.0K setup.py
96M utilities
Lots of these are in utilities/filter_molecule_sets
:
lski1962:filter_molecule_sets choderaj$ du -sh *
28M DrugBank.sdf
3.1M DrugBank_CHO_atyped.mol2
19M DrugBank_atyped.oeb
2.7M DrugBank_updated_ff.mol2.gz
2.6M DrugBank_updated_tripos.mol2.gz
1.5M MiniDrugBank_ff.mol2
1.1M MiniDrugBank_ff_withGenerics.mol2
1.5M MiniDrugBank_tripos.mol2
1.1M MiniDrugBank_tripos_withGenerics.mol2
8.0K README.md
4.0K elements_exclude.txt
2.7M ff_test.mol2.gz
12K filter_molecule_sets.py
16K pickMolecules.ipynb
4.0K remove_smirks_CHO.smarts
4.0K remove_smirks_simple.smarts
2.6M tripos_test.mol2.gz
3.9M updated_DrugBank.mol2.gz
What if we moved some of these larger groups of molecule sets to external repos that we can import for testing?
I'm just concerned that someone who wants to grab the openforcefield
toolkit may not need 200M of files just to assign SMIRNOFF parameters.
Alternatively, I suppose we could just be parsimonious about what we actually install/package inside of conda packages.
It might be me, I think one of the notebook on my branch in /examples/forcefield_modification/ was quite big. I have now shrunk it.
Seems like the right way to deal with this is to not package/install things which are not needed. Someone who doesn't want to do more than assign parameters doesn't need all the molecule sets which are for testing/development/etc., for example.
My most recent pull request removed most of the files in filter_molecule_sets if that helps.
I think we might have to clear out those files from the git history: https://stackoverflow.com/questions/2100907/how-to-remove-delete-a-large-file-from-commit-history-in-git-repository
@t-kimber and I recently used BFG Repo Cleaner (mentioned in the SO question) with great success. It's a Java script, but easy to use.
I've used that before too.
This remains a minor nuisance when needing to pip install git+git://github.com/openforcefield/openff-toolkit.git@param-iter
since it clones the repo with all of its history and pip does not support shallow cloning. It takes about 2 minutes on my mediocre residential internet, but still about 20-30 seconds on CI machines that probably have 1-10 gigabit connections. Not the slowest step in CI builds but it adds up when all workflows need to run it.
You can do pip install https://github.com/openforcefield/openff-toolkit/archive/param-iter.tar.gz
I think!
Oh, hey, that works great! I had assumed that GitHub didn't make/have archives for all branches (i.e. feature branches that are not tagged) but I was wrong. That drops the install time to just a few seconds.
AFAIK the "filename" can be any git ref, so even hash commits will work.
Checking out
openforcefield
from github now pulls down over 200 MB in over 5000 files. What happened here?