Closed alrichardbollans closed 1 year ago
Hi,
Matching stereochemically-defined structures and their 2D equivalent is risky. Especially if you later link it to bioactivity data. Regarding your question, I would rather use InChIKeys to cross-reference them. They were made therefore, not as SMILES, which are not unique. In case you still want to match the 2D, you can do it by matching the 14 first characters only.
Side note 1:
Wikidata offers chembl ids, even it they might not be comprehensive:
SELECT * WHERE {
?item wdt:P592 ?chembl_id.
}
Side note 2: All the pre-computed "2D SMILES" are also available at: https://doi.org/10.5281/zenodo.5794106
Hope this answers your question
This is really helpful, thank you!
I've downloaded some metabolite data from LOTUS and am trying to cross reference this with data from ChemBL. It seems that one of the more reliable ways to do this would be to use the SMILES key.
Looking at some examples in LOTUS e.g. https://lotus.naturalproducts.net/compound/lotus_id/LTS0095286, the SMILES given by Wikidata are (canonical) "COC1=CC2=C(C=CN=C2C=C1)C(C3CC4CCN3CC4C=C)O" and (isomeric) "COC1=CC2=C(C=CN=C2C=C1)C@HO", neither of which appear to provide a direct match in ChemBL. In contrast, the 2D Smiles given by lotus for this metbolite "C=CC1CN2CCC1CC2C(O)c1ccnc2ccc(OC)cc12" matches with the ChemBL compound (https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL15088/).
This 2D smiles given by LOTUS appears in general to match with chembl, and seems to be the result of applying the rdkit method:
Chem.CanonSmiles(x)
to the 'canonical' smiles given in Wikidata. My question is it possible to download this 2D SMILES directly and if not, is my guess as to how it is generated correct?Note, I'm downloading the data using the query: