openforcefield / openff-qcsubmit

Automated tools for submitting molecules to QCFractal
https://openff-qcsubmit.readthedocs.io/en/latest/index.html
MIT License
26 stars 4 forks source link

Missing molecules when submited. #61

Open jthorton opened 3 years ago

jthorton commented 3 years ago

Sometimes when we new submit datasets we see fewer tasks than expected actually created in QCArchive. In this for example, there are 1043 unique optimizations expected by QCSubmit but when running this locally we see only 1041 tasks are actually made. The cause of this is the fast deduplication check which is done by QCFractal when we add a molecule to a dataset. It checks if the index.lower() is already in the dataset however as we use the smiles to index the molecules our index labels are case sensitive which causes different molecules to be considered the same for example: c1ccc(cc1)Oc2ccccc2 image and c1ccc(cc1)OC2CCCCC2 image Cc1ccccc1Oc2ccccc2

image and Cc1ccccc1OC2CCCCC2 image

So to solve this we would have to make sure that the index for the molecule was not case sensitive (like inchikey) or actually unique such as adding explicit hydrogens to the smiles.

jthorton commented 3 years ago

It is planned that QCArchive will stop lowering the index we store the molecules as which will allow us to use case sensitive indexing. I will leave this open until the problem is fixed in QCArchive.

Until then users can get around this issue by changing the index in the dataset object to add any tag they whish to make the index unique.