raver8 / ML_chemical

0 stars 0 forks source link

23 inconsistent SMILES <--> InChI <--> InChIKey representations #10

Closed jschrier closed 1 year ago

jschrier commented 1 year ago

I've added the next sanity check: For each record that is a pure substance (has molecular identifiers), is the given SMILES (when converted to a Molecule) consistent with the given InChI (when converted)? And is the InChI consistent with the InChIKey?

There are 23 cases where this sanity check breaks. Check code is in the updated merge_chemical_dictionaries.nb

Here's an example of the first error:

TestObject["consistentSMILESInChIQ[<|names -> {AMMONIUM NITRATE, \
6484-52-2, Norway Saltpeter, Ammonium Nitricum, Ammonium Saltpeter, \
Nitrate Of Ammonia, Nitric Acid Ammonium Salt, Azanium Nitrate}, pure \
substance -> True, CAS -> 6484-52-2, SMILES -> \
O=[N+]([O-])[O-].[N+]([H])([H])([H])[H], InChIKey -> \
DVARTQFDIMZBAA-UHFFFAOYSA-O, InChI -> \
InChI=1S/NO3.H3N/c2-1(3)4;/h;1H3/q-1;/p+1, PubchemCID -> 22985, URL -> \
https://pubchem.ncbi.nlm.nih.gov/compound/22985|>]"],

What's going on here? Looks like a proton tautomer difference in the representation: One is NH4+ . NO3- the other is NH3 . HNO3 . So it should be easy to make these consistent. As before, you should be able to run the whole merge_chemical_dictionaries.nb notebook to confirm that you've solved all errors.

Screenshot 2023-10-02 at 3 32 06 PM
oliviavanden commented 1 year ago

Sounds good! I'll run the merge_chemical_dictionaries.nb and fix the errors.

jschrier commented 1 year ago

@oliviavanden : to save you a step, the current version of merge_chemical_dictionaries.nb now lists the active errors (no need to run it again...until you want to check if they are fixed)

oliviavanden commented 1 year ago

Awesome thanks!

oliviavanden commented 1 year ago

Most of the errors have been fixed with only 4 remaining. Some issues that have been established is with Ammonium Nitrate, N1,N3-di(hexa-1,3,5-triyn-1-yl)-N1,N3-dimethylmalonamide--dihydrogen \ (1/10), and N-(3,4,5-trimethylphenyl)-1,10-phenanthroline-2-carboxamide. These molecules either have bugs or need further inspection from literature.

oliviavanden commented 1 year ago

Most of these errors were bas SMILES, InChIs, or InChIKeys. I used the Mathematica script provided, and other mathematica checks to see what information was consistent and what wasn't.