lotusnprod / lotus-web

Code for LOTUS web
https://lotus.naturalproducts.net/
MIT License
13 stars 5 forks source link

Returning all metabolites in a given clade, including possibly missing properties #61

Closed alrichardbollans closed 1 year ago

alrichardbollans commented 1 year ago

I'm trying extract all metabolites in a plant order and include given CAS ID, INCHIKey and Smiles information. When I run:

SELECT DISTINCT ?structure ?structureLabel ?structure_smiles ?structureCAS ?structureINCHIKEY ?organism ?organism_name WHERE {
  VALUES ?taxon {
    wd:Q21754                                    # You can remove the Qxxxxxx and hit Ctrl+space, type the first letters and it should autocomplete
  }
  ?organism (wdt:P171*) ?taxon;                   # Include children taxa
                        wdt:P225 ?organism_name.  # Get organism name
  ?structure wdt:P233 ?structure_smiles;          # Get the SMILES
             (p:P703/ps:P703) ?organism.          # Found in given taxon/taxa

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100000

I get 20968 results, however when I try to include CASID and INCHIKEY information with the following:

SELECT DISTINCT ?structure ?structureLabel ?structure_smiles ?structureCAS ?structureINCHIKEY ?organism ?organism_name WHERE {
  VALUES ?taxon {
    wd:Q21754                                    # You can remove the Qxxxxxx and hit Ctrl+space, type the first letters and it should autocomplete
  }
  ?organism (wdt:P171*) ?taxon;                   # Include children taxa
                        wdt:P225 ?organism_name.  # Get organism name
  ?structure wdt:P233 ?structure_smiles;          # Get the SMILES
             (p:P703/ps:P703) ?organism;          # Found in given taxon/taxa
             wdt:P231 ?structureCAS;          # Get the CAS
             wdt:P235 ?structureINCHIKEY.          # Get the INCHIKEY

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100000

I only get 7967 results. I imagine this might be because the latter query doesn't return instances without a CAS ID or INCHIKEY. Is it possible to return all metabolites found in taxa and leave missing values for the properties as NaN?

Originally posted by @alrichardbollans in https://github.com/lotusnprod/lotus-web/issues/27#issuecomment-1619999166

Adafede commented 1 year ago

Hi @alrichardbollans,

You are perfectly right, what you were missing is the OPTIONAL, allowing for a property also not to be present.

Here is probably what you were looking for:

SELECT DISTINCT ?structure ?structureLabel ?structure_smiles ?structure_cas ?structure_inchikey ?organism ?organism_name WHERE {
  VALUES ?taxon {
    wd:Q21754
  }
  ?organism (wdt:P171*) ?taxon;
    wdt:P225 ?organism_name.
  ?structure wdt:P233 ?structure_smiles;
    (p:P703/ps:P703) ?organism.
  OPTIONAL {
    ?structure wdt:P231 ?structure_cas;
      wdt:P235 ?structure_inchikey.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100000

Hope this answers your question, happy to elaborate if not. 👍🏼

alrichardbollans commented 1 year ago

Aha this is great, thanks! Still getting my head around SPARQL so this is really handy. How would I also make the SMILES key optional?

Adafede commented 1 year ago

The issue you might face by putting it as optional is that you would end up having things that are not necessarily small molecules. You should then force given instances at the beginning (like in https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry_Natural_products#What_was_already_there?) and I am not sure you would have much more results. You could eventually switch the current InChIKey/SMILES if you want to try.

alrichardbollans commented 1 year ago

OK, thanks for this!

alrichardbollans commented 1 year ago

I've just noticed that the INCHI key isn't being returned for metabolites in some taxa, even though the InChi key is given in lotus/wikidata. For example, with the query:

SELECT DISTINCT ?structure ?structureLabel ?structure_smiles ?structure_cas ?structure_inchikey ?organism ?organism_name WHERE {
  VALUES ?taxon {
    wd:Q55925442
  }
  ?organism (wdt:P171*) ?taxon; # Include children taxa
    wdt:P225 ?organism_name. # Get organism name
  ?structure wdt:P233 ?structure_smiles; # Get the SMILES
    (p:P703/ps:P703) ?organism.   # Found in given taxon/taxa
  OPTIONAL {
    ?structure wdt:P231 ?structure_cas;
      wdt:P235 ?structure_inchikey.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100000

The structure wd:Q104888293 is returned but no value is provided for its structure_inchikey. Why is this?

Adafede commented 1 year ago

Good catch!

Something like

SELECT DISTINCT ?structure ?structureLabel ?structure_smiles ?structure_cas ?structure_inchikey ?organism ?organism_name WHERE {
  VALUES ?taxon {
    wd:Q55925442
  }
  ?organism (wdt:P171*) ?taxon;
    wdt:P225 ?organism_name.
  ?structure wdt:P233 ?structure_smiles;
    (p:P703/ps:P703) ?organism.
  OPTIONAL { ?structure wdt:P235 ?structure_inchikey. }
  OPTIONAL { ?structure wdt:P231 ?structure_cas. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100000

Should solve this, do not hesitate to reopen in case

alrichardbollans commented 1 year ago

This is great! Is it possible to also make the SMILES also optional, or is this redundant? My attempt is:

SELECT DISTINCT ?structure ?structureLabel ?structure_smiles ?structure_cas ?structure_inchikey ?organism ?organism_name WHERE {
  VALUES ?taxon {
    wd:Q55925442
  }
  ?organism (wdt:P171*) ?taxon;
    wdt:P225 ?organism_name.
  ?structure (p:P703/ps:P703) ?organism.
  OPTIONAL { ?structure wdt:P235 ?structure_inchikey. }
  OPTIONAL { ?structure wdt:P233 ?structure_smiles. }
  OPTIONAL { ?structure wdt:P231 ?structure_cas. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
Adafede commented 1 year ago

I would not recommend it, but it is feasible. The problem is not the redundancy but rather having something you can trust. An (almost) empty entry with no SMILES, no CAS, no InChIKey, I would hardly trust.

alrichardbollans commented 1 year ago

Ok thanks, this is good to know. My intention is to incorporate this into my data by matching CAS, SMILES or InCHIKeys so effectively those instances with none of these would be ignored. I guess ideally the query would return all those metabolites with at least of one CAS, SMILES or InCHIKeys

Adafede commented 1 year ago

Something like

SELECT DISTINCT ?structure ?structureLabel ?structure_smiles ?structure_cas ?structure_inchikey ?organism ?organism_name WHERE {
  VALUES ?taxon {
    wd:Q55925442
  }
  ?organism (wdt:P171*) ?taxon;
    wdt:P225 ?organism_name.
  ?structure (p:P703/ps:P703) ?organism.
  OPTIONAL { ?structure wdt:P235 ?structure_inchikey. }
  OPTIONAL { ?structure wdt:P233 ?structure_smiles. }
  OPTIONAL { ?structure wdt:P231 ?structure_cas. }
  BIND (CONCAT(COALESCE(?structure_inchikey,""), COALESCE(?structure_smiles,""), COALESCE(?structure_cas,"")) AS ?key)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
  FILTER (STRLEN(?key) > 1)
}
LIMIT 100000

should do the trick. I do not think there are any "fully empty" entries to test but anyway...