ncbo / bioportal-project

Serves to consolidate (in Zenhub) all public issues in BioPortal
BSD 2-Clause "Simplified" License
7 stars 5 forks source link

AG: SPARQL::Client::MalformedQuery: MALFORMED QUERY: Line 1, Found '<'. #251

Open alexskr opened 1 year ago

alexskr commented 1 year ago

Parsing fails with AllegroGraph backend for HOOM, ELD, MCCL, ANC and other private ontologies with the following error:

E, [2022-07-26T14:00:29.114959 #20327] ERROR -- : Failed, exception: SPARQL::Client::MalformedQuery: MALFORMED QUERY: Line 1, Found '<'. Was expecting one of: ABS, AVG, BNODE, BOUND, CEIL, COALESCE, CONCAT, CONTAINS, COUNT, DATATYPE, DAY, DECIMAL, DOUBLE, ENCODE_FOR_URI, EXISTS, FALSE, FLOOR, GROUP_CONCAT, HOURS, IF, INTEGER, IRI, ISBLANK, ISIRI, ISLITERAL, ISNUMERIC, ISTRIPLE, ISURI, LANG, LANGMATCHES, LCASE, MAX, MD5, MIN, MINUTES, MONTH, NOT, NOW, NUMERIC-PLUS, Q_IRI_REF, QNAME, QNAME_NS, RAND, REGEX, REPLACE, ROUND, SAMETERM, SAMPLE, SECONDS, SHA1, SHA256, SHA384, SHA512, STR, STRAFTER, STRBEFORE, STRDT, STRENDS, STRING_LITERAL1, STRING_LITERAL2, STRING_LITERAL_LONG1, STRING_LITERAL_LONG2, STRLANG, STRLEN, STRSTARTS, STRUUID, SUBSTR, SUM, TIMEZONE, TRUE, TZ, UCASE, URI, UUID, VARNAME, YEAR or punctuation '!', '(', '+', '-', '<<'.
alexskr commented 1 year ago

we are seeing a similar error for the following API calls:

  1. /ontologies/M4M-21-VARIABLES/classes/http%3A%2F%2Fpurl.org%2Fm4m21%2Fvariables%2F1006/mappings

Stack Trace:

SPARQL::Client::MalformedQuery: MALFORMED QUERY: Line 5, Found '<'. Was expecting one of: BIND, BLANK_NODE_LABEL, DECIMAL, DOUBLE, FALSE, FILTER, GRAPH, INTEGER, MINUS, NIL-SYMBOL, OPTIONAL, Q_IRI_REF, QNAME, QNAME_NS, SELECT, SERVICE, STRING_LITERAL1, STRING_LITERAL2, STRING_LITERAL_LONG1, STRING_LITERAL_LONG2, TEXTINDEX, TRUE, VALUES, VARNAME or punctuation '(', '+', '-', '<<', '[', '[]', '{', '}'.
…ases/20220811020542/controllers/mappings_controller.rb:   13:in `block in <class
<truncated 73 additional frames>
/srv/ncbo/ontologies_api/shared/bundle/ruby/2.7.0/bin/unicorn:23:in `load'
/srv/ncbo/ontologies_api/shared/bundle/ruby/2.7.0/bin/unicorn:23:in `<top (required)>'
/usr/local/rbenv/versions/2.7.6/bin/bundle:23:in `load'
/usr/local/rbenv/versions/2.7.6/bin/bundle:23:in `<main>'
  1. /ontologies/DDIEM/classes/http%3A%2F%2Fgroups.google.com%2Fgroup%2Fogms-discuss%2Fbrowse_thread%2Fthread%2Fca0ad373f27774c5%0A%0AOGMS%20call%20adoption-%2016%20SEPT%202015%0Ahttps%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1iiV1-fTS7BUUSzDw3N_Afx42698YWf54-

Stack Trace:

SPARQL::Client::MalformedQuery: MALFORMED QUERY: Line 1, Found '<'. Was expecting one of: ABS, AVG, BNODE, BOUND, CEIL, COALESCE, CONCAT, CONTAINS, COUNT, DATATYPE, DAY, DECIMAL, DOUBLE, ENCODE_FOR_URI, EXISTS, FALSE, FLOOR, GROUP_CONCAT, HOURS, IF, INTEGER, IRI, ISBLANK, ISIRI, ISLITERAL, ISNUMERIC, ISTRIPLE, ISURI, LANG, LANGMATCHES, LCASE, MAX, MD5, MIN, MINUTES, MONTH, NOT, NOW, NUMERIC-PLUS, Q_IRI_REF, QNAME, QNAME_NS, RAND, REGEX, REPLACE, ROUND, SAMETERM, SAMPLE, SECONDS, SHA1, SHA256, SHA384, SHA512, STR, STRAFTER, STRBEFORE, STRDT, STRENDS, STRING_LITERAL1, STRING_LITERAL2, STRING_LITERAL_LONG1, STRING_LITERAL_LONG2, STRLANG, STRLEN, STRSTARTS, STRUUID, SUBSTR, SUM, TIMEZONE, TRUE, TZ, UCASE, URI, UUID, VARNAME, YEAR or punctuation '!', '(', '+', '-', '<<'.
…_api/releases/20220811020542/helpers/classes_helper.rb:   59:in `get_class'
…eases/20220811020542/controllers/classes_controller.rb:   78:in `block(2 levels) in <class
<truncated 80 additional frames>
/srv/ncbo/ontologies_api/shared/bundle/ruby/2.7.0/bin/unicorn:23:in `load'
/srv/ncbo/ontologies_api/shared/bundle/ruby/2.7.0/bin/unicorn:23:in `<top (required)>'
/usr/local/rbenv/versions/2.7.6/bin/bundle:23:in `load'
/usr/local/rbenv/versions/2.7.6/bin/bundle:23:in `<main>'

3./ontologies/NCOD/properties

Stack Trace:

SPARQL::Client::MalformedQuery: MALFORMED QUERY: Line 3, Found '<'. Was expecting one of: BLANK_NODE_LABEL, DECIMAL, DOUBLE, FALSE, INTEGER, NIL-SYMBOL, PATH-PLUS, Q_IRI_REF, QNAME, QNAME_NS, STRING_LITERAL1, STRING_LITERAL2, STRING_LITERAL_LONG1, STRING_LITERAL_LONG2, TRUE, VARNAME or punctuation '(', ')', '*', '+', '-', '/', '<<', '?', '[', '[]', '{', '|'.
…es/20220811020542/controllers/properties_controller.rb:   10:in `block(2 levels) in <class
<truncated 77 additional frames>
/srv/ncbo/ontologies_api/shared/bundle/ruby/2.7.0/bin/unicorn:23:in `load'
/srv/ncbo/ontologies_api/shared/bundle/ruby/2.7.0/bin/unicorn:23:in `<top (required)>'
/usr/local/rbenv/versions/2.7.6/bin/bundle:23:in `load'
/usr/local/rbenv/versions/2.7.6/bin/bundle:23:in `<main>'
mdorf commented 1 year ago

I am investigating this issue. The following query, indeed, appears to be incorrectly composed:

SELECT DISTINCT ?s2 ?g ?source ?o
WHERE {
    {
      GRAPH <http://data.bioontology.org/ontologies/M4M-21-VARIABLES/submissions/5> {
          <http://purl.org/m4m21/?s2 ?g ?source ?o/1006> <http://bioportal.bioontology.org/ontologies/umls/cui> ?o .
      }
      GRAPH ?g {
          ?s2 <http://bioportal.bioontology.org/ontologies/umls/cui> ?o .
      }
      BIND ('CUI' AS ?source)
    }
    UNION
    {
      GRAPH <http://data.bioontology.org/ontologies/M4M-21-VARIABLES/submissions/5> {
          <http://purl.org/m4m21/?s2 ?g ?source ?o/1006> <http://data.bioontology.org/metadata/def/mappingSameURI> ?o .
      }
      GRAPH ?g {
          ?s2 <http://data.bioontology.org/metadata/def/mappingSameURI> ?o .
      }
      BIND ('SAME_URI' AS ?source)
    }
    UNION
    {
      GRAPH <http://data.bioontology.org/ontologies/M4M-21-VARIABLES/submissions/5> {
          <http://purl.org/m4m21/?s2 ?g ?source ?o/1006> <http://data.bioontology.org/metadata/def/mappingLoom> ?o .
      }
      GRAPH ?g {
          ?s2 <http://data.bioontology.org/metadata/def/mappingLoom> ?o .
      }
      BIND ('LOOM' AS ?source)
    }
    UNION
    {
      GRAPH <http://data.bioontology.org/ontologies/M4M-21-VARIABLES/submissions/5> {
          <http://purl.org/m4m21/?s2 ?g ?source ?o/1006> <http://data.bioontology.org/metadata/def/mappingRest> ?o .
      }
      GRAPH ?g {
          ?s2 <http://data.bioontology.org/metadata/def/mappingRest> ?o .
      }
      BIND ('REST' AS ?source)
    }
    FILTER (!STRSTARTS(str(?g),'http://data.bioontology.org/ontologies/M4M-21-VARIABLES'))
} 

You can see that the lines <http://purl.org/m4m21/?s2 ?g ?source ?o/1006> <http://data.bioontology.org/metadata/def/mappingRest> ?o . contain predicates inside the <> brackets, which should not be the case.

jonquet commented 1 year ago

cc @syphax-bouazzouni as we were looking at this query recently to "improve" mappings gathering in AgroPortal.

mdorf commented 1 year ago

The failing query for /ontologies/NCOD/properties:

SELECT ?c WHERE {
    GRAPH <http://data.bioontology.org/ontologies/NCOD/submissions/1> {
      ?c <http://www.w3.org/2000/01/rdf-schema#subPropertyOf> <http://www.geneontology.org/formats/oboInOwl\#@prefix dcat> . 
    }
}
LIMIT 1
mdorf commented 1 year ago

This error appears to be thrown by AllegroGraph much more frequently than by 4store. The main cause is the presence of illegal characters in the ClassID.

I was able to identify a number of places in our code, where replacing the characters such as " ", "<" or ">" in the ClassID with their URL-encoded counterparts addresses the issue. But, there are multiple other cases, where the constructed query is malformed due to the special characters present in the ClassID. For example, ELD parsing fails due to this query:

SELECT DISTINCT ?id ?prefLabel ?synonym ?label
FROM <http://data.bioontology.org/ontologies/ELD/submissions/5>
WHERE { 
    ?id a <http://www.w3.org/2004/02/skos/core#Concept> . 
    OPTIONAL { 
        ?id ?rewrite0 ?prefLabel . 
        FILTER(?rewrite0 = <http://data.bioontology.org/metadata/def/prefLabel> || ?rewrite0 = <http://www.w3.org/2004/02/skos/core#prefLabel>)
    }
    OPTIONAL { 
        ?id <http://www.w3.org/2004/02/skos/core#altLabel> ?synonym . 
    } 
    OPTIONAL { 
        ?id <http://www.w3.org/2000/01/rdf-schema#label> ?label .  
    } 
    FILTER(?id = <https://github.com/VODANA/Controlled-vocabularyseverepneumonia(Otherpneumonia,organismunspecified)> || 
           ?id = <https://github.com/VODANA/Controlled-vocabularycordprolapse(Labouranddeliverycomplicatedbyprolapseofcord)> || 
           ?id = <https://github.com/VODANA/Controlled-vocabularycommoncold(Acutenasopharyngitis[commoncold])>)
}

This query fails because of characters "[" and "]" present in the last ID, which are reserved SPARQL characters.

mdorf commented 1 year ago

This endpoint call fails due to a space inside the last ClassID:

/ontologies/NCOD/properties

Here is the resulting SPARQL query:

SELECT ?c WHERE {
    GRAPH <http://data.bioontology.org/ontologies/NCOD/submissions/1> {
        ?c <http://www.w3.org/2000/01/rdf-schema#subPropertyOf> 
                  <http://www.geneontology.org/formats/oboInOwl#@prefix dcat> .
    }
}
LIMIT 1
mdorf commented 1 year ago

This is the query that fails during MCCL ontology parsing due to the "[" and "]" characters present in an ID inside the last FILTER clause:

SELECT DISTINCT ?id ?prefLabel ?synonym ?label 
FROM <http://data.bioontology.org/ontologies/MCCL/submissions/2> 
WHERE { 
    ?id a <http://www.w3.org/2002/07/owl#Class> . 
    OPTIONAL { 
        ?id ?rewrite0 ?prefLabel . 
        FILTER(?rewrite0 = <http://data.bioontology.org/metadata/def/prefLabel> || ?rewrite0 = <http://www.w3.org/2004/02/skos/core#prefLabel>) 
    } OPTIONAL { 
        ?id ?rewrite1 ?synonym . 
        FILTER(?rewrite1 = <http://www.geneontology.org/formats/oboInOwl#hasRelatedSynonym> || ?rewrite1 = <http://www.geneontology.org/formats/oboInOwl#hasNarrowSynonym> || ?rewrite1 = <http://www.geneontology.org/formats/oboInOwl#hasBroadSynonym> || ?rewrite1 = <http://purl.obolibrary.org/obo/synonym> || ?rewrite1 = <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> || ?rewrite1 = <http://www.w3.org/2004/02/skos/core#altLabel>) 
    } OPTIONAL { 
        ?id <http://www.w3.org/2000/01/rdf-schema#label> ?label .  
    } 
    FILTER(?id = <http://www.w3.org/2002/07/owl#Thing> || ?id = <http://www.semanticweb.org/pallabi.d/ontologies/2014/2/untitled-ontology-11#ZNRF4-Arg149*>
           || ?id = <http://www.semanticweb.org/pallabi.d/ontologies/2014/2/untitled-ontology-11#KCNH8-Glu143*> 
           || ?id = <http://www.semanticweb.org/pallabi.d/ontologies/2014/2/untitled-ontology-11#KB-CH[R]-8-5Cell>) 
}
mdorf commented 1 year ago

The exact list of affected ontologies: OPTUM, AUTISM, ELD, MCCL, ETH_ANC, ANC, HOOM.

GAZ and DRON also report MalformedQuery errors but the error is different:

SPARQL::Client::MalformedQuery: QUERY FAILED: Not CaaT state: nil within set #<db.agraph.sbqe::bindings-set 1[3] ?id 0(0) solutions @ #x100ce92bcf2>
mdorf commented 1 year ago

I was able to identify the SPARQL query that causes the error: QUERY FAILED: Not CaaT state: nil...

SELECT DISTINCT ?id FROM <http://data.bioontology.org/ontologies/DRON/submissions/14>
WHERE { ?id a <http://www.w3.org/2002/07/owl#Class> . } OFFSET 500000 LIMIT 2500

I experimented with the offsets and found the following pattern:

Executes fine:

SELECT DISTINCT ?id FROM <http://data.bioontology.org/ontologies/DRON/submissions/14>
WHERE { ?id a <http://www.w3.org/2002/07/owl#Class> . } OFFSET 499000 LIMIT 2500

Fails:

SELECT DISTINCT ?id FROM <http://data.bioontology.org/ontologies/DRON/submissions/14>
WHERE { ?id a <http://www.w3.org/2002/07/owl#Class> . } OFFSET 499001 LIMIT 2500

Executes fine:

SELECT DISTINCT ?id FROM <http://data.bioontology.org/ontologies/DRON/submissions/14>
WHERE { ?id a <http://www.w3.org/2002/07/owl#Class> . } OFFSET 499001 LIMIT 999

Fails:

SELECT DISTINCT ?id FROM <http://data.bioontology.org/ontologies/DRON/submissions/14>
WHERE { ?id a <http://www.w3.org/2002/07/owl#Class> . } OFFSET 499001 LIMIT 1000

Running a COUNT query on the graph yields these results:

SELECT (COUNT(*) as ?Triples) 
WHERE { GRAPH <http://data.bioontology.org/ontologies/DRON/submissions/14> { ?s ?p ?o } }
Triples
"4870128"

and distinct:

SELECT (COUNT(DISTINCT *) as ?Triples) 
WHERE { GRAPH <http://data.bioontology.org/ontologies/DRON/submissions/14> { ?s ?p ?o } }
Triples
"4848151"

It looks like the graph contains close to 5 million triples, so the offset of 500K technically should work fine, but it doesn’t.

mdorf commented 1 year ago

The issue below has now been resolved by deploying a patch from AllegroGraph (bug26872-v7.3.0.fasl.patch). This patch is rolled into the future versions of AllegroGraph, so it won't need to be maintained beyond this version.

SPARQL::Client::MalformedQuery: QUERY FAILED: Not CaaT state: nil within set #<db.agraph.sbqe::bindings-set 1[3] ?id 0(0) solutions @ #x100ce92bcf2>