openforcefield / qca-dataset-submission

Data generation and submission scripts for the QCArchive ecosystem.
Other
32 stars 6 forks source link

Add several high priority datasets for benchmarking #51

Open davidlmobley opened 5 years ago

davidlmobley commented 5 years ago

We need several additional datasets for benchmarking/testing. @jchodera has volunteered to prep these this weekend, so this issue is to get everything all in the same place in order of the priority I would assign them:

  1. Pfizer set. 100 challenging fragments from Pfizer for torsion drives. #50
  2. Genentech set. Optimization dataset as provided, filtering out largest molecules first. Then optimization dataset and torsion drive dataset after fragmentation. #48
  3. DrugBank FDA drugs. DrugBank discussed here would be a good set; I'd focus on the FDA-approved small-molecule drugs and then throw out everything big and everything very small, then fragment for optimization and torsion drives. Probably also remove anything with pentavalent carbon for good measure. Problem: I don't have a DrugBank account yet and it takes two business days for one to be approved, it seems.
  4. Informative set. Optimization dataset of 1117 informative fragments. Discussed in issue #46 . (The larger set includes 9000 compounds which could be fragmented and torsion drives could be done.)

I'm checking into some options on (3) so I might have updates. Or not.

jchodera commented 5 years ago