Open jchodera opened 5 years ago
We would also want to filter somewhat first, or at least somewhat more highly than typical QCA datasets, since a lot of this will contain rather unusual chemistry relative to drugs.
I view the Gobbi set as higher priority, though that one has additionally been filtered by criteria relating to electron density quality and pKa.
The PDB Ligand Expo contains ~30K small molecules that appear in the PDB. This would be a good set to ensure we have adequate coverage of chemical space, since these molecules are of high interest in research.
A minimal subset of 648 has already been prepared by @bgobbi in https://github.com/openforcefield/open-forcefield-data/pull/30
We will likely need to fragment this dataset before processing it due to the size of some of these fragments (up to 79 atoms in size).