lotusnprod / lotus-web

Code for LOTUS web
https://lotus.naturalproducts.net/
MIT License
13 stars 5 forks source link

Inconsistency in taxon:compound, name:compound mapping? Example (Q105216729) #59

Open tatyanalivshultz opened 1 year ago

tatyanalivshultz commented 1 year ago

Hello: I want to use Lotus to reconstruct evolution of natural products on phylogenetic trees. I began entering some references and taxon:compound pairs that are not in Lotus (or Wikidata) and found some unexpected complications and/or inconsistencies on how the taxon: compound and name:compound mapping depicted in Lotus in encoded in Wikidata and how these associations can be retrieved using queries of Wikidata. Can you help me understand what is happening so that I can add and retrieve data effectively? Thank you! Tanya

I give as example the compound blepharin (Q105216729 on Wikidata).

1. If you query "blepharin" in Lotus, you get no results. 2. If you query the InChIKey encoding of blepharin, "PYQSUTLVBSTCSK-UHFFFAOYSA-N", in Lotus, you get "Q105216729 2-[(3-hydroxy-2h-1,4-benzoxazin-2-yl)oxy]-6-(hydroxymethyl)oxane-3,4,5-triol". 3. That name "2-[(3-hydroxy-2h-1,4-benzoxazin-2-yl)oxy]-6-(hydroxymethyl)oxane-3,4,5-triol" does not appear in the Wikidata (nor PubChem) record for "blepharin". "Blepharin" does not occur in the Lotus record for "2-[(3-hydroxy-2h-1,4-benzoxazin-2-yl)oxy]-6-(hydroxymethyl)oxane-3,4,5-triol". Why not? Where is this name coming from, and why doesn't the name in the Wikidata record appear on Lotus (and vice versa)? 4. The chemical structure illustrated on Lotus for Q105216729 is not the same as chemical structure illustrated on PubChem for PubChem CID 14605136 (blepharin). PubChem has the structure depicted with N-C(=O). Lotus has the structure illustrated with N=C(-OH). Are these the same structure? 5. When you look at the Wikidata record for blepharin (Q105216729), you do not see the taxa listed on Lotus as containing this compound (record Q105216729 on Lotus). There is no "found in taxon P703" statement in the Wikidata record for the species listed on Lotus: Acanthus montanus, Blepharis edulis, Acanthus ebracteatus . 6. When you click on the Wikidata symbol for the reference for each of these taxon records in Lotus, you will see that the species name and the compound name "blepharin" both occur in the "main subject P921" statement for the reference (Q42783412) on Wikidata. This seems a dangerous way to make the taxon:compound link since a publication may has as its "main subject" multiple species and multiple compounds, but all combinations of them do not necessarily occur. How many of the taxon:compound links in Lotus are made via this "main subject" statement in the reference? Is there a plan to transfer these links to "found in taxon P703" statements on the compound Wikidata record? 7. When I query Wikidata for all compounds in, e.g. Acanthus montanus (Q4672080), it does not return blepharin (Q105216729). The query yields 10 compounds. See query text below.

SELECT DISTINCT ?taxon ?children ?childrenLabel ?structure ?structureLabel ?structure_inchi WHERE { VALUES ?taxon { wd:Q4672080 # You can remove the Qxxxxxx and hit Ctrl+space, type the first letters and it should autocomplete } ?children (wdt:P171*) ?taxon. # Include children taxa ?structure wdt:P234 ?structure_inchi ; # Get the InChI (p:P703/ps:P703) ?children. # Found in given taxon/taxa

SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }}

8. When I query Lotus for the text string "Acanthus montanus", the query returns 16 compounds, including Q105216729. How can I retrieve the Lotus result from Wikidata?

Adafede commented 1 year ago

Hi @tatyanalivshultz!

Thank you very much for getting in touch! Your project looks fantastic. It is something we have also wanted to do for a long time. Really thrilled to see where this will go! (I quickly went through your bibliography...very cool!)

For the specific questions: First of all, the data on lotus.naturalproducts.net is not up-to-date. We currently have some limitations in updating it so please better take the data from Wikidata, Zenodo (https://zenodo.org/communities/the-lotus-initiative), or PubChem (feeding on Zenodo also)

  1. Chemical names are heavily discouraged to look for chemical compounds. They are not "identifiers" at all and can lead to huge discrepancies indeed. Checking the names of hundreds of thousands of compounds is not trivial, so many of them are also possibly incorrect in many sources.
  2. This is exactly what I wanted to suggest. InChIKeys are the way to go.
  3. The names present on the website were generated using proprietary software (molconvert by ChemAxon). This is not the case anymore and is the reason why names can change. There are additional limitations in Wikidata, as the labels cannot be more than 250 characters long, so sometimes you might not find the name on Wikidata. Moreover, there is currently no "chemical name" property on Wikidata, so we only rely on the label anyone can change and eventually adapt to their language. It looks very intuitive to search for "limonene" but if you want to do so for the whole tree of Life, you will have to forget it...
  4. What you mention here are tautomers. We have some of them in the LOTUS corpus and eventually not all of them can be perfectly standardized. (The chemical "truth" is rather an equilibrium between the different species, changing depending on solvent, pH, etc.) This "problem" is known in cheminformatics for many years, but I think there is still no real solution to it.
  5. The data on Wikidata is moving every second. If someone considered the "found in taxon" statement incorrect and removed it, it won't appear anymore. If someone adds (like you did, thank you 😊 ) new statements, they won't appear on the other LOTUS endpoints instantly. We usually try to do trimestrial versions of LOTUS, including all the new changes made on Wikidata, they are then stored on Zenodo.
  6. Wow, you went deep into digging, beautiful! Those statements (on the references) were actually made by one of our collaborators and were based on the "found in taxon" statements we had at the time. They will probably lose synchronization with time going, as most probably 99% of the people will only update the data on one side. The tagging of "main subject" on articles was made to identify literature matching given subjects, mainly in the frame of Scholia. See https://scholia.toolforge.org/taxon/Q135389 for example. This might change in the future following some of our recent discussions (https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry/Natural_products#Mapping_near_to_ubiquituous_compounds)
  7. True, because there is no "found in taxon" Acanthus anymore. This statement was removed from Wikidata (correctly or not, as for any community-based curation, 99% of it is good, we cannot avoid human errors but it goes toward the better).
  8. I think you already found https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry/Natural_products#Queries to guide some of your queries, your query is correct, no issue. 6 of the 16 compounds present in the outdated data were removed.

We are really happy to discuss anything in more depth! Please reach out so we can see how to best help you achieve what you want (also including chemical similarity in the speciation gradient, for example, etc.)

tatyanalivshultz commented 1 year ago

Dear Adriano:

Thank you for your detailed reply. I apologize for my delayed reply. I've been working on other projects and am just now getting back to this one.

I would like to contribute to Lotus, via Wikidata, the taxon:compound information for Apocynaceae that I have among my references, and then export all the taxon:compound records from Wikidata to reconstruct evolution of chemical classes on a phylogeny of Apocynaceae.

Thank you for your detailed replies to my questions about chemical names (1-3). I am not a chemist (I'm a plant taxonomist) so the redundancy and ambiguity of chemical names is unfamiliar to me. Given what you told me, I think the following would be the best way for me to proceed:

  1. If the article provides a SMILES or InChIKey formula for a chemical, use that to find it in PubChem.
  2. If there is only a chemical name, locate an image of the chemical structure. Either the structure is illustrated in the article or it is illustrated in one of the references.
  3. Capture each image as a png file and use OSRA ( https://cactus.nci.nih.gov/cgi-bin/osra/index.cgi) to convert it to a SMILES formula.
  4. Check that the SMILES formula is correct by using SMILES to Image ( http://hulab.rxnfinder.org/smi2img/) to check against the original image.
  5. Search PubChem for the SMILES formula.

One of the first issues that I encountered with this approach is a compound that doesn't appear to be in PubChem. I'm pretty sure that the SMILES formula is correct. How would I proceed in this situation?

I'm attaching the reference to this email. The compound is compound 1 in the article (see also png attached). And below is the SMILES formula that I got from OSRA and checked on SMILES to Image. Nc1ccccc1C(=O)OC3OC(COC2OCC(O)(CO)C2O)C(O)C(O)C3O

If you have any advice on best practices for doing this, I'd love to hear it!

Best regards, Tanya

Tatyana Livshultz

On Thu, Mar 16, 2023 at 3:17 AM Adriano Rutz @.***> wrote:

Hi @tatyanalivshultz https://github.com/tatyanalivshultz!

Thank you very much for getting in touch! Your project looks fantastic. It is something we have also wanted to do for a long time. Really thrilled to see where this will go! (I quickly went through your bibliography...very cool!)

For the specific questions: First of all, the data on lotus.naturalproducts.net is not up-to-date. We currently have some limitations in updating it so please better take the data from Wikidata, Zenodo ( https://zenodo.org/communities/the-lotus-initiative), or PubChem (feeding on Zenodo also)

  1. Chemical names are heavily discouraged to look for chemical compounds. They are not "identifiers" at all and can lead to huge discrepancies indeed. Checking the names of hundreds of thousands of compounds is not trivial, so many of them are also possibly incorrect in many sources.
  2. This is exactly what I wanted to suggest. InChIKeys are the way to go.
  3. The names present on the website were generated using proprietary software (molconvert by ChemAxon). This is not the case anymore and is the reason why names can change. There are additional limitations in Wikidata, as the labels cannot be more than 250 characters long, so sometimes you might not find the name on Wikidata. Moreover, there is currently no "chemical name" property on Wikidata, so we only rely on the label anyone can change and eventually adapt to their language. It looks very intuitive to search for "limonene" but if you want to do so for the whole tree of Life, you will have to forget it...
  4. What you mention here are tautomer https://en.wikipedia.org/wiki/Tautomers. We have some of them in the LOTUS corpus and eventually not all of them can be perfectly standardized. (The chemical "truth" is rather an equilibrium between the different species, changing depending on solvent, pH, etc.) This "problem" is known in cheminformatics for many years, but I think there is still no real solution to it.
  5. The data on Wikidata is moving every second. If someone considered the "found in taxon" statement incorrect and removed it, it won't appear anymore. If someone adds (like you did, thank you 😊 ) new statements, they won't appear on the other LOTUS endpoints instantly. We usually try to do trimestrial versions of LOTUS, including all the new changes made on Wikidata, they are then stored on Zenodo.
  6. Wow, you went deep into digging, beautiful! Those statements (on the references) were actually made by one of our collaborators and were based on the "found in taxon" statements we had at the time. They will probably lose synchronization with time going, as most probably 99% of the people will only update the data on one side. The tagging of "main subject" on articles was made to identify literature matching given subjects, mainly in the frame of Scholia https://scholia.toolforge.org/. See https://scholia.toolforge.org/taxon/Q135389 for example. This might change in the future following some of our recent discussions ( https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry/Natural_products#Mapping_near_to_ubiquituous_compounds )
  7. True, because there is no "found in taxon" Acanthus anymore. This statement was removed from Wikidata (correctly or not, as for any community-based curation, 99% of it is good, we cannot avoid human errors but it goes toward the better).
  8. I think you already found https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry/Natural_products#Queries to guide some of your queries, your query is correct, no issue. 6 of the 16 compounds present in the outdated data were removed.

We are really happy to discuss anything in more depth! Please reach out so we can see how to best help you achieve what you want (also including chemical similarity in the speciation gradient, for example, etc.)

— Reply to this email directly, view it on GitHub https://github.com/lotusnprod/lotus-web/issues/59#issuecomment-1471434146, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2RIB2VY4V47LJ722VY5T43W4K47LANCNFSM6AAAAAAV4NQP5I . You are receiving this because you were mentioned.Message ID: @.***>

Adafede commented 1 year ago

Your work looks amazing (and we clearly need plant taxonomists!).

If there is anything we can do to help, very happy to!

What you suggest looks good. I would recommend using https://decimer.ai/ developed by some collaborators for the structure recognition from image.

Do not hesitate to contact us for more details if needed.

PS: As a strating point: https://w.wiki/6bt5

Adafede commented 1 year ago

@tatyanalivshultz By the way, thanks to some amazing collaborators, a huge list of novel alkaloidic occurrences were added to WD, see: https://www.wikidata.org/w/index.php?title=Special:Contributions/NPImporterBot&target=NPImporterBot&offset=&limit=500