Create test dataset - Githubissues

This script generates a test set for evaluating models trained on SPICE. It tries to measure how well models generalize to new molecules that weren't in the training set, and more specifically how well they generalize to larger molecules than they were trained on.

It includes the following.

200 LigandExpo molecules with between 40 and 50 atoms. The amino-acid/ligand subset used LigandExpo molecules, but the largest ones are only 36 atoms, so none of these were included. We do have lots of PubChem molecules of this size, so it measures generalization to new molecules of the same size as the training data.
200 LigandExpo molecules with between 70 and 80 atoms. These are larger than any single molecule in the training set (though some clusters are this large). It measures generalization to larger molecules.
200 random pentapeptides. The training set contains all possible dipeptides, so this measures generalization to longer peptides (and hopefully to proteins, but running QM on full proteins would be very expensive).

There are 10 conformations for each molecule, giving a total of 6000 conformations.

openmm / spice-dataset

Create test dataset #98