openforcefield / openff-qcsubmit

Automated tools for submitting molecules to QCFractal
https://openff-qcsubmit.readthedocs.io/en/latest/index.html
MIT License
26 stars 4 forks source link

Make loading datasets skip over corrupt entries with invalid CMILES #228

Closed j-wags closed 1 year ago

j-wags commented 1 year ago

Todos

Status

jthorton commented 1 year ago

@j-wags Thanks for sorting this one, I have added a mocked dataset with an invalid entry which should test this case. Thinking about the other dataset types would we want to extend this fix to those as well?

codecov[bot] commented 1 year ago

Codecov Report

Merging #228 (19f3b1b) into main (1538216) will decrease coverage by 0.02%. The diff coverage is 92.30%.

Additional details and impacted files
j-wags commented 1 year ago

Oh my gosh, I tried so hard to make a mocked dataset like that, and now you made it look so easy.

Ok, merging in the changes from main, this looks pretty good. In its current form this keeps the OE backend from crashing on bad entries, but loading the dataset using OE vs. RDKit will yield a different number of molecules. I might make a push to standardize that, but in the longer run I'd kinda rather switch QCSubmit over to an RDKit-only backend to avoid this whole category of problems.

This may be ready to merge as-is. I'll sleep on it. Thanks again for the help @jthorton!

jthorton commented 1 year ago

In its current form this keeps the OE backend from crashing on bad entries, but loading the dataset using OE vs. RDKit will yield a different number of molecules.

Good point should this be an issue on the toolkit that it should raise an error when trying to create a molecule from an incomplete mapped smiles?

I might make a push to standardize that, but in the longer run I'd kinda rather switch QCSubmit over to an RDKit-only backend to avoid this whole category of problems.

That was my original idea however protonation state enumeration is only available via openeye currently and removing that would be an issue for reproducibility :(

j-wags commented 1 year ago

Yeah, after some thought the only pathway to consistently handling the corrupt molecules with both backends is to make changes in RDKitToolkitWrapper in the OFF Toolkit.

I've added some to-dos above - Most of them are bookkeeping and followups, but one thing I'm going to do here is add a warning when a dataset entry is skipped for having an invalid CMILES.

j-wags commented 1 year ago

Thanks for the huge assist @jthorton!