lotusnprod / lotus-web

Code for LOTUS web
https://lotus.naturalproducts.net/
MIT License
13 stars 5 forks source link

How to download all the chemical compound and their related data of an organism from LOTUS ? #27

Closed ap1438 closed 2 years ago

ap1438 commented 2 years ago

So, i have an organism and i want to download all the chemical compounds related to that organism with their smile ID and the species that produce those chemical compounds.

So what i did was just search in the web page and found all the entries of chemical compounds related to that organism. And downloaded the SDF file which was the only downloading option available. And later converted it to excel format.

But what i realized was that file was missing compound names.

So what i wanted was Compound name, Smile ID, Species it is present.

Is is possible to get it as such from the LOTUS database by any means ?

Adafede commented 2 years ago

Hi!

Thank you for your issue.

Actually, we support a lot of custom searches (see https://lotus.naturalproducts.net/documentation) but not the specific one you requested.

We might provide a SPARQL endpoint in the future to handle such requests but in the meantime, querying Wikidata directly seems a good option.

I prepared a query you can easily adapt for you: https://w.wiki/5GSw. You can directly download the results as a tabular file there.

Another option could be to use https://pubchem.ncbi.nlm.nih.gov/classification/#hid=115 and search there directly, they offer CSV download also.

More generally, the compounds' names are automatically generated so we would advise being very cautious with them.

Best

ap1438 commented 2 years ago

Thank you for your quick response and valuable suggestion. As i see the code and downloaded the data the fields molecular formulae was missing. So, i tried to modify the code and download the molecular formulae also. But i don't know why it shows query time limit reached. So, I tried this code

https://w.wiki/5GgJ

Can you check and guide me where did i go wrong.

Adafede commented 2 years ago

You were almost there!

I think the query you want is: https://w.wiki/5Ggd

Your was querying again against whole Wikidata for molecules

ap1438 commented 2 years ago

Thanks for the correction and insights.

ap1438 commented 2 years ago

Search for "Gentiana" returned 483 natural products in LOTUS Database search in LOTUS webpage. BUT wiki data query returns 768 . Why is this much difference.

Can you please let me know the reason behind the difference?

Adafede commented 2 years ago

Hi,

Not exactly, the query I wrote you gives structure-organism pairs. So the same structure can appear multiple times. If you want to reduce it to distinct structures, here: https://w.wiki/5J73.

Hope this clarifies

ap1438 commented 2 years ago

Thank you

alrichardbollans commented 1 year ago

I'm trying to do something similar and following your examples, when I run:

SELECT DISTINCT ?structure ?structureLabel ?structure_smiles ?structureCAS ?structureINCHIKEY ?organism ?organism_name WHERE {
  VALUES ?taxon {
    wd:Q21754                                    # You can remove the Qxxxxxx and hit Ctrl+space, type the first letters and it should autocomplete
  }
  ?organism (wdt:P171*) ?taxon;                   # Include children taxa
                        wdt:P225 ?organism_name.  # Get organism name
  ?structure wdt:P233 ?structure_smiles;          # Get the SMILES
             (p:P703/ps:P703) ?organism.          # Found in given taxon/taxa

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100000

I get 20968 results, however when I try to include CASID and INCHIKEY information with the following:

SELECT DISTINCT ?structure ?structureLabel ?structure_smiles ?structureCAS ?structureINCHIKEY ?organism ?organism_name WHERE {
  VALUES ?taxon {
    wd:Q21754                                    # You can remove the Qxxxxxx and hit Ctrl+space, type the first letters and it should autocomplete
  }
  ?organism (wdt:P171*) ?taxon;                   # Include children taxa
                        wdt:P225 ?organism_name.  # Get organism name
  ?structure wdt:P233 ?structure_smiles;          # Get the SMILES
             (p:P703/ps:P703) ?organism;          # Found in given taxon/taxa
             wdt:P231 ?structureCAS;          # Get the CAS
             wdt:P235 ?structureINCHIKEY.          # Get the INCHIKEY

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100000

I only get 7967 results. I imagine this might be because the latter query doesn't return instances without a CAS ID or INCHIKEY. Is it possible to return all metabolites found in taxa and leave missing values for the properties as NaN?