example Disease Annotation in Uniprot in README.md not working

adeslatt commented 4 days ago

Hello -- just learning and may not know sytack -- but the example

up:Disease_Annotation {
  a [ up:Disease_Annotation ] ;
  up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ;
  rdfs:comment xsd:string ;
  up:disease IRI
}

Results in a malformed query when. you try it on the sparql endpoint for unitprot.

I set up a jupyter lab notebook - and this worked very nicely

from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd

# Set up the UniProt SPARQL endpoint
sparql = SPARQLWrapper("https://sparql.uniprot.org/sparql")

# Define a query to fetch available Disease Annotation data
query_disease_annotations_simple = """
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?disease_annotation ?comment ?disease
WHERE {
  ?disease_annotation a up:Disease_Annotation ;
                      rdfs:comment ?comment ;
                      up:disease ?disease .
}
LIMIT 10
"""

# Execute the query and format the output in a DataFrame
sparql.setQuery(query_disease_annotations_simple)
sparql.setReturnFormat(JSON)

try:
    # Execute query and retrieve results
    results_disease_simple = sparql.query().convert()

    # Parse the results
    disease_data_simple = [
        {
            "Disease Annotation": result["disease_annotation"]["value"],
            "Comment": result["comment"]["value"],
            "Disease": result["disease"]["value"]
        }
        for result in results_disease_simple["results"]["bindings"]
    ]

    # Create a DataFrame to display the results
    df_disease_simple = pd.DataFrame(disease_data_simple)

    # Wrap text for 'Comment' column in Jupyter display
    df_disease_simple_styled = df_disease_simple.style.set_properties(
        **{'white-space': 'pre-wrap', 'text-align': 'left'}
    )

    display(df_disease_simple_styled)
except Exception as e:
    print(f"Error occurred: {e}")

Returns this (organized with pandas)

Disease Annotation  Comment Disease
0   http://purl.uniprot.org/uniprot/Q9UDR5#SIP17D85FE178BE13B6  The disease is caused by variants affecting the gene represented in this entry. In hyperlysinemia 1, both enzymatic functions of AASS are defective and patients have increased serum lysine and possibly increased saccharopine. Some individuals, however, retain significant amounts of lysine-ketoglutarate reductase and present with saccharopinuria, a metabolic condition with few, if any, clinical manifestations.    http://purl.uniprot.org/diseases/1773
1   http://purl.uniprot.org/uniprot/Q9UDR5#SIP77BA87EDDA8559D2  The protein represented in this entry is involved in disease pathogenesis. A selective decrease in mitochondrial NADP(H) levels due to NADK2 mutations causes a deficiency of NADPH-dependent mitochondrial enzymes, such as DECR1 and AASS.    http://purl.uniprot.org/diseases/4240
2   http://purl.uniprot.org/uniprot/Q9UGJ0#SIP473418E25D4D3A3B  The disease is caused by variants affecting the gene represented in this entry. http://purl.uniprot.org/diseases/1676
3   http://purl.uniprot.org/uniprot/Q9UGJ0#SIPBA4A3C214C09B2B7  The disease is caused by variants affecting the gene represented in this entry. http://purl.uniprot.org/diseases/245
4   http://purl.uniprot.org/uniprot/Q9UGJ0#SIPF5992DDE995A022F  The disease is caused by variants affecting the gene represented in this entry. http://purl.uniprot.org/diseases/1150
5   http://purl.uniprot.org/uniprot/P00519#SIP961ECAA35D2F0134  The disease is caused by variants affecting the gene represented in this entry. http://purl.uniprot.org/diseases/5064
6   http://purl.uniprot.org/uniprot/P00519#SIPDFB66D0B5174D549  The gene represented in this entry is involved in disease pathogenesis. http://purl.uniprot.org/diseases/3735
7   http://purl.uniprot.org/uniprot/Q13085#SIPE73D1EB0068562AA  The disease is caused by variants affecting the gene represented in this entry. http://purl.uniprot.org/diseases/1164
8   http://purl.uniprot.org/uniprot/Q6UWZ7#SIP86B515DA1B7AD8CF  Disease susceptibility is associated with variants affecting the gene represented in this entry.    http://purl.uniprot.org/diseases/2602
9   http://purl.uniprot.org/uniprot/A8K2U0#SIPCE73AF232236B8B1  Disease susceptibility is associated with variants affecting the gene represented in this entry.    http://purl.uniprot.org/diseases/5294

I modified the query a bit more using regular expression package (re) and used this:


import pandas as pd
import re

# Assuming df_disease_simple from Step 1 already exists

# Define regex patterns for variants and genes
variant_pattern = r"\bvariant\s\w+\b|\bmutation\b|\bpolymorphism\b"  # Adjust patterns as needed
gene_pattern = r"\b[A-Z0-9]{2,}\b"  # Basic pattern for gene identifiers, e.g., BRCA1, TP53

# Extract details for each disease annotation
extracted_info = []
for _, row in df_disease_simple.iterrows():
    disease_id = row["Disease"]
    comment = row["Comment"]

    # Find all variants and gene mentions
    variants = re.findall(variant_pattern, comment, flags=re.IGNORECASE)
    genes = re.findall(gene_pattern, comment)

    # Store results in a structured format
    extracted_info.append({
        "Disease": disease_id,
        "Comment": comment,
        "Variants": variants,
        "Genes": genes
    })

# Convert to DataFrame
df_extracted_info = pd.DataFrame(extracted_info)

# Apply wrapping style to comment for readability
df_extracted_info_styled = df_extracted_info.style.set_properties(
    **{'white-space': 'pre-wrap', 'text-align': 'left'}
)

# Display the wrapped DataFrame in Jupyter
display(df_extracted_info_styled)

Disease	Comment	Variants	Genes
http://purl.uniprot.org/diseases/1773	The disease is caused by variants affecting the gene represented in this entry. In hyperlysinemia 1, both enzymatic functions of AASS are defective and patients have increased serum lysine and possibly increased saccharopine. Some individuals, however, retain significant amounts of lysine-ketoglutarate reductase and present with saccharopinuria, a metabolic condition with few, if any, clinical manifestations.	[]	['AASS']
http://purl.uniprot.org/diseases/4240	The protein represented in this entry is involved in disease pathogenesis. A selective decrease in mitochondrial NADP(H) levels due to NADK2 mutations causes a deficiency of NADPH-dependent mitochondrial enzymes, such as DECR1 and AASS.	[]	['NADP', 'NADK2', 'NADPH', 'DECR1', 'AASS']
http://purl.uniprot.org/diseases/1676	The disease is caused by variants affecting the gene represented in this entry.	[]	[]
http://purl.uniprot.org/diseases/245	The disease is caused by variants affecting the gene represented in this entry.	[]	[]
http://purl.uniprot.org/diseases/1150	The disease is caused by variants affecting the gene represented in this entry.	[]	[]
http://purl.uniprot.org/diseases/5064	The disease is caused by variants affecting the gene represented in this entry.	[]	[]
http://purl.uniprot.org/diseases/3735	The gene represented in this entry is involved in disease pathogenesis.	[]	[]
http://purl.uniprot.org/diseases/1164	The disease is caused by variants affecting the gene represented in this entry.	[]	[]
http://purl.uniprot.org/diseases/2602	Disease susceptibility is associated with variants affecting the gene represented in this entry.	[]	[]
http://purl.uniprot.org/diseases/5294	Disease susceptibility is associated with variants affecting the gene represented in this entry.	[]	[]



Hope this helps.

vemonet commented 2 days ago

Hi @adeslatt , sorry I am not sure I understood your issue :)

Are you trying to run this as a SPARQL query?

up:Disease_Annotation {
  a [ up:Disease_Annotation ] ;
  up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ;
  rdfs:comment xsd:string ;
  up:disease IRI
}

If yes, then it is normal it is not working as is, because it is not a SPARQL query, this is a ShEx "Shape Expression", basically a schema for RDF data, see here for more details: https://shex.io

We are using it to pass the endpoint schema to the LLM

adeslatt commented 2 days ago

Hi @vemonet , Thank you so much! Yes I thought it was a SPARQL query -- I was not familiar with Shape Expression thank you for the reference -- can this work on the RDF as a file itself? I just exported from a database and made turtle files -- we are working with a non-SPARQL graph database (it is a ArangoDB instance).

vemonet commented 2 days ago

ShEx expression are usually used to describe the schema of the RDF data, and perform validation of RDF data (here we just use it to communicate the schema of the different classes in our knowledge graph to the LLM, so it knows which predicates can be used with the different classes). In my opinion ShEx is a bit harder to use than SPARQL (because the libraries are less mature), so I would just load the RDF you have in a store and run SPARQL queries

If you just want to run queries on RDF data you could load it with RDFLib then run queries: https://rdflib.readthedocs.io/en/stable/

sib-swiss / sparql-llm

example Disease Annotation in Uniprot in README.md not working #1