openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
313 stars 92 forks source link

reduce use of test_toolkits.get_mini_drug_bank() #958

Open adalke opened 3 years ago

adalke commented 3 years ago

I will be working on the toolkit wrapper API. I want to reduce the time needed to run the tests to make it easier to run the test suite after changes. Currently (with RDKit but not OEChem available) it takes about 11 seconds to import openff.toolkit.tests.toolkits. Of that, about 8 seconds is in test_toolkits.py:get_mini_drug_bank(). I think this can be reduced significantly.

That function is used only in a pytest decorator, like:

    @pytest.mark.parametrize("molecule", get_mini_drug_bank(OpenEyeToolkitWrapper))
    def test_to_inchi(self, molecule):

This parses molecules/MiniDrugBank.sdf at each call. There are three for each toolkit.

This could be reduced by caching the molecule loading, and/or moving the load into the test function rather than use pytest.mark.parameterize. The former would speed up all tests. The latter would help someone who wants to run only some of the tests in the module.

A further analysis suggests that most of those test cases are not needed. The test for toolkit_wrapper's to_inchi() is mostly forwarding the call to the underlying toolkit. There seems to be no benefit for OpenFF to augment the QA process for the toolkit developers.

If there is a benefit, then the test cases should be chosen to be more diverse, for example, auranofin with a gold atom, which gets disconnected by InChI.

Along these lines, InChI is not meant as an input structure format. How much does OpenFF need to support this option? (Perhaps best left for another issue.)

I will be working on this topic and adding notes/comments to this issue.

j-wags commented 3 years ago

A further analysis suggests that most of those test cases are not needed. The test for toolkit_wrapper's to_inchi() is mostly forwarding the call to the underlying toolkit. There seems to be no benefit for OpenFF to augment the QA process for the toolkit developers.

This is largely due to our ignorance -- We want this test to fail if a code change in openff.toolkit.topology.Molecule.to_rdkit would cause it to output some RDMols that are unable to make InChIs. But we don't know of a more "first principles" way to guard for this this without just throwing a bunch of molecules through the process.

Do you know if there's a more general-purpose way to ensure that Molecule.to_rdkit is doing the right thing?

If there is a benefit, then the test cases should be chosen to be more diverse, for example, auranofin with a gold atom, which gets disconnected by InChI.

I agree that a more carefully-selected test set would be a big improvement. Ideally one which contains a few "boring" molecules, and also molecules that exercise tricky edge cases or complexities (like the metal you mentioned, and molecules that vary by which aromaticity model is used)