openforcefield / qca-dataset-submission

Data generation and submission scripts for the QCArchive ecosystem.
Other
29 stars 6 forks source link

Potential dataset: PDB Ligand Expo (30K molecules) #18

Open jchodera opened 5 years ago

jchodera commented 5 years ago

The PDB Ligand Expo contains ~30K small molecules that appear in the PDB. This would be a good set to ensure we have adequate coverage of chemical space, since these molecules are of high interest in research.

A minimal subset of 648 has already been prepared by @bgobbi in https://github.com/openforcefield/open-forcefield-data/pull/30

We will likely need to fragment this dataset before processing it due to the size of some of these fragments (up to 79 atoms in size).

davidlmobley commented 5 years ago

We would also want to filter somewhat first, or at least somewhat more highly than typical QCA datasets, since a lot of this will contain rather unusual chemistry relative to drugs.

I view the Gobbi set as higher priority, though that one has additionally been filtered by criteria relating to electron density quality and pKa.