SMILES outputs from LOTUS and WikiData

alrichardbollans commented 1 year ago

I've downloaded some metabolite data from LOTUS and am trying to cross reference this with data from ChemBL. It seems that one of the more reliable ways to do this would be to use the SMILES key.

Looking at some examples in LOTUS e.g. https://lotus.naturalproducts.net/compound/lotus_id/LTS0095286, the SMILES given by Wikidata are (canonical) "COC1=CC2=C(C=CN=C2C=C1)C(C3CC4CCN3CC4C=C)O" and (isomeric) "COC1=CC2=C(C=CN=C2C=C1)C@HO", neither of which appear to provide a direct match in ChemBL. In contrast, the 2D Smiles given by lotus for this metbolite "C=CC1CN2CCC1CC2C(O)c1ccnc2ccc(OC)cc12" matches with the ChemBL compound (https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL15088/).

This 2D smiles given by LOTUS appears in general to match with chembl, and seems to be the result of applying the rdkit method: Chem.CanonSmiles(x) to the 'canonical' smiles given in Wikidata. My question is it possible to download this 2D SMILES directly and if not, is my guess as to how it is generated correct?

Note, I'm downloading the data using the query:

SELECT DISTINCT ?structure ?structureLabel ?structure_smiles ?structure_cas ?structure_inchikey ?organism ?organism_name WHERE {
VALUES ?taxon {
  wd:Q21754 # Gentianales
}
?organism (wdt:P171*) ?taxon;
  wdt:P225 ?organism_name.
?structure (p:P703/ps:P703) ?organism.
OPTIONAL { ?structure wdt:P235 ?structure_inchikey. }
OPTIONAL { ?structure wdt:P233 ?structure_smiles. }
OPTIONAL { ?structure wdt:P231 ?structure_cas. }
OPTIONAL { ?organism wdt:P961 ?ipniID. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100000
:return:

Adafede commented 1 year ago

Hi,

Matching stereochemically-defined structures and their 2D equivalent is risky. Especially if you later link it to bioactivity data. Regarding your question, I would rather use InChIKeys to cross-reference them. They were made therefore, not as SMILES, which are not unique. In case you still want to match the 2D, you can do it by matching the 14 first characters only.

Side note 1:

Wikidata offers chembl ids, even it they might not be comprehensive:

SELECT * WHERE { 
?item wdt:P592 ?chembl_id. 
}

Side note 2: All the pre-computed "2D SMILES" are also available at: https://doi.org/10.5281/zenodo.5794106

Hope this answers your question

alrichardbollans commented 1 year ago

This is really helpful, thank you!

lotusnprod / lotus-web

SMILES outputs from LOTUS and WikiData #65