sparna-git / Sparnatural

Sparnatural : visual SPARQL query builder for knowledge graphs in the browser, configurable with SHACL
http://sparnatural.eu
GNU Lesser General Public License v3.0
216 stars 40 forks source link

Please support Jena Full Text Search in AutoCompleteWidget #447

Open elasticjava opened 1 year ago

elasticjava commented 1 year ago

Please support Jena Full Text Search in AutoCompleteWidget.

There are a lot of sparql resources implemented with Apache Jena out there.

Apache Jena has an extension to ARQ called Jena Full Text Search combining SPARQL and full text search via Lucene. It gives applications the ability to perform indexed full text searches within SPARQL queries way more faster than with reqular SPARQL.

See: https://jena.apache.org/documentation/query/text-query.html

I did´t get it to implement the following queries in the typescript syntax of sparqljs to implement it by myself...

How do I map parentheses within the subject?

Example query for Autocomplete-Search:

PREFIX owl:    <http://www.w3.org/2002/07/owl#> 
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX xml:    <http://www.w3.org/XML/1998/namespace> 
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#> 
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX skos:   <http://www.w3.org/2004/02/skos/core#> 
PREFIX text:   <http://jena.apache.org/text#> 
PREFIX config-datasources: <http://data.sparna.fr/ontologies/sparnatural-config-datasources#> 
PREFIX config-core: <http://data.sparna.fr/ontologies/sparnatural-config-core#> 
SELECT DISTINCT ?uri ?label ?blankType
        WHERE {
            {
                SELECT DISTINCT ?uri ?score WHERE {
                 (?uri ?score) text:query (rdfs:label "Bayer") .
                    ?uri a/rdfs:subClassOf* <https://schema.coypu.org/global#Company>  .
                }
                ORDER BY DESC(?score)
                LIMIT 100 OFFSET 0
            }
            OPTIONAL {?uri rdf:type ?foundClass}
                BIND (coalesce(?foundClass, owl:Thing) as ?class)
                OPTIONAL {?uri rdfs:label|skos:prefLabel ?label}
        } ORDER BY DESC(?score)

If I find and select data I´d like to filter it by its URI and not by its selected property like label like:

PREFIX owl:    <http://www.w3.org/2002/07/owl#>
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos:   <http://www.w3.org/2004/02/skos/core#>
 SELECT ?uri ?label
 WHERE {
     ?uri rdf:type <https://schema.coypu.org/global#Company>;
     FILTER (?uri = <https://data.coypu.org/company/gleif/52990063JAR3WYTRS232>)
      OPTIONAL {?uri rdf:type ?foundClass}
      BIND (coalesce(?foundClass, owl:Thing) as ?class)
      OPTIONAL {?uri rdfs:label|skos:prefLabel ?label}
 }
tfrancart commented 1 year ago

We acknowledge the use-case but we don't have that need for the moment so we will not spend efforts on this.

We do have a specific SPARQL generation code for GraphDB full-text syntax : https://github.com/sparna-git/Sparnatural/blob/ba13bd91181cb1dd97c3bac76330dae35413f82a/src/sparnatural/components/widgets/SearchRegexWidget.ts#L101

I assume that the implementation of Jena-specific search would be done in a similar way, if you want to give it a try and submit a PR. What you can do is : first try to parse your SPARQL query using sparqljs and spot the section you need with text:query, etc. Then see how the JSON is constructed. Then try to build that same JSON structure in the SPARQL generation code.

SteinerPascal commented 1 year ago

@elasticjava did it work? Ottherwise I could try and build an example.

elasticjava commented 1 year ago

Thank you very much for the quick feedback! I am currently trying to worm my way in. I modifed Config.GRAPHDB_SEARCH_PROPERTY at place to see the impact of my changes. How I create a completely new JenaTextSearchProperty, extend the onthology and get this ListHandler connected I leave out of my consideration for now.

tfrancart commented 1 year ago

How I create a completely new JenaTextSearchProperty, extend the onthology and get this ListHandler connected I leave out of my consideration for now.

Good approach ! if you have something working then I can certainly do that.

Regarding the connection of handlers, I assume you mean "get the AutocompleteHandler" connected, since this is where full-text search can happens (and not in list). And for this you can already provide your own SPARQL queries to populate the autocomplete handlers. This is called the "datasource" mechanism and is documented at http://docs.sparnatural.eu/OWL-based-configuration-datasources see in particular http://docs.sparnatural.eu/OWL-based-configuration-datasources#your-own-sparql-query-lists--autocomplete

elasticjava commented 1 year ago

I got it working BUT since Apache Jena Text is a plugin the arrangement of triples matters. The issue is, that Jena's optimizer doesn't know about the text index because this is just a plugin. So we need to manually decide whether its better to first query the text index or to first select resources e.g. by type

At present it generates the following sparql:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?Company_1 ?Company_1_label WHERE {
  ?Company_1 rdf:type <https://schema.coypu.org/global#Company>;
    rdfs:label ?Company_1_label, ?Text_2.
  {
    SELECT DISTINCT ?Company_1 ?score WHERE {
      _:e_g_0 <http://jena.apache.org/text#query> _:e_g_2.
      _:e_g_2 rdf:first rdfs:label;
        rdf:rest _:e_g_3.
      _:e_g_3 rdf:first "Bayer*";
        rdf:rest rdf:nil.
      _:e_g_0 rdf:first ?Company_1;
        rdf:rest _:e_g_1.
      _:e_g_1 rdf:first ?score;
        rdf:rest rdf:nil.
      ?Company_1 (rdf:type/(rdfs:subClassOf*)) <https://schema.coypu.org/global#Company>.
    }
    ORDER BY DESC (?score)
    LIMIT 100
  }
}
LIMIT 100

BUT that is way to slow, because first all Companies are fetched.

It should generate:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?Company_1 ?Company_1_label WHERE {
  {
    SELECT DISTINCT ?Company_1 ?score WHERE {
      _:e_g_0 <http://jena.apache.org/text#query> _:e_g_2.
      _:e_g_2 rdf:first rdfs:label;
        rdf:rest _:e_g_3.
      _:e_g_3 rdf:first "Bayer*";
        rdf:rest rdf:nil.
      _:e_g_0 rdf:first ?Company_1;
        rdf:rest _:e_g_1.
      _:e_g_1 rdf:first ?score;
        rdf:rest rdf:nil.
      ?Company_1 (rdf:type/(rdfs:subClassOf*)) <https://schema.coypu.org/global#Company>.
    }
    ORDER BY DESC (?score)
    LIMIT 100
  }
  ?Company_1 rdf:type <https://schema.coypu.org/global#Company>;
    rdfs:label ?Company_1_label, ?Text_2.
}
LIMIT 100

Can I rearrange the query after generation?

elasticjava commented 1 year ago

bad practice - but for now the edited source here inlined for the modifed SearchRegexWidget.getRdfJsPattern to set you up:

case Config.GRAPHDB_SEARCH_PROPERTY: {
                // builds an Apache Jena-specific search pattern
                let bgpPatternForLuceneQuery: BgpPattern = SparqlFactory.buildBgpPattern([
                    {
                        subject: DataFactory.blankNode("g_0"),
                        predicate: DataFactory.namedNode("http://jena.apache.org/text#query"),
                        object: DataFactory.blankNode("g_2"),
                    },
                    {
                        subject: DataFactory.blankNode("g_2"),
                        predicate: DataFactory.namedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#first"),
                        object: DataFactory.namedNode("http://www.w3.org/2000/01/rdf-schema#label"),
                    },
                    {
                        subject: DataFactory.blankNode("g_2"),
                        predicate: DataFactory.namedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#rest"),
                        object: DataFactory.blankNode("g_3"),
                    },
                    {
                        subject: DataFactory.blankNode("g_3"),
                        predicate: DataFactory.namedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#first"),
                        object: DataFactory.literal(
                            `${this.widgetValues[0].value.regex}*`
                        ),
                    },
                    {
                        subject: DataFactory.blankNode("g_3"),
                        predicate: DataFactory.namedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#rest"),
                        object: DataFactory.namedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#nil"),
                    },
                    {
                        subject: DataFactory.blankNode("g_0"),
                        predicate: DataFactory.namedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#first"),
                        object: DataFactory.variable(
                            this.getVariableValue(this.startClassVal)
                        ),

                    },
                    {
                        subject: DataFactory.blankNode("g_0"),
                        predicate: DataFactory.namedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#rest"),
                        object: DataFactory.blankNode("g_1"),
                    },
                    {
                        subject: DataFactory.blankNode("g_1"),
                        predicate: DataFactory.namedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#first"),
                        object: DataFactory.variable("score")
                    },
                    {
                        subject: DataFactory.blankNode("g_1"),
                        predicate: DataFactory.namedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#rest"),
                        object: DataFactory.namedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#nil"),
                    },
                    {
                        subject: DataFactory.variable(
                            this.getVariableValue(this.startClassVal)
                        ),
                        predicate: {
                            type: "path",
                            pathType: "/",
                            items: [
                                DataFactory.namedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"),
                                {
                                    type: "path",
                                    pathType: "*",
                                    items: [
                                        DataFactory.namedNode("http://www.w3.org/2000/01/rdf-schema#subClassOf")
                                    ]
                                }
                            ]
                        },
                        object: DataFactory.namedNode(this.startClassVal.type)
                    }
                ])

                // builds an Apache Jena-specific subquery with respect to ordering by lucene score
                let subqueryTopHitsForLuceneQuery: GroupPattern = SparqlFactory.buildGroupPattern([
                    {
                        type: "query",
                        limit: 100,
                        offset: 0,
                        queryType: "SELECT",
                        distinct: true,
                        prefixes: {},
                        variables: [
                            DataFactory.variable(
                                this.getVariableValue(this.startClassVal)
                            ),
                            DataFactory.variable("score")
                        ],
                        where: [bgpPatternForLuceneQuery],
                        order: [
                            {
                                expression: DataFactory.variable("score"),
                                descending: true
                            } as Ordering
                        ]
                    }
                ])

                return [
                    subqueryTopHitsForLuceneQuery
                ];
            }

the "http://www.w3.org/2000/01/rdf-schema#label" is hardcoded - how do I get it from the widget values?

tfrancart commented 1 year ago

Can I rearrange the query after generation?

You could maybe play with the isBlockingStart, isBlockingObjectProp and isBlockingEnd function, inherited from the class WidgetValue, that you can overwrite. What these will do is that they will prevent the normal SPARQL query generation code to generate the ?Company_1 rdf:type <https://schema.coypu.org/global#Company>; ... part of the query - but then you have to insert them yourself when building the getRdfJsPattern function.

the "http://www.w3.org/2000/01/rdf-schema#label" is hardcoded - how do I get it from the widget values?

this.objectPropVal.type I think.

SteinerPascal commented 1 year ago

Okay good to see that you are progressing! An additional note: If you set isBlockingStart to false, then it will also prevent building the isBlockingObjectProp. That is due to the fact, that the object property can not be build without the the first selected value.

For the rearranging part: I think @tfrancart is right and the best way to do it would be to block the starting creation part and insert it yourself. Something like this might work (untested):

  let bgp: BgpPattern = SparqlFactory.buildBgpPattern([
    SparqlFactory.buildTriple(
         DataFactory.variable(this.getVariableValue(this.startClassVal)),
         DataFactory.namedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"),
         DataFactory.variable(this.startClassVal.type)
     ),
  ]);

// And for the return value
return [
    subqueryTopHitsForLuceneQuery,
    bgp
 ];

For the matter of the hardcoded "http://www.w3.org/2000/01/rdf-schema#label": Can you provide a SPARQL example how the result should be? Not sure if I understand correctly what you are trying to achieve

tfrancart commented 1 year ago

@elasticjava I have added the necessary entries in the configuration ontology as well as in the widget code to fill in the Jena-specific query generation; see the comment at https://github.com/sparna-git/Sparnatural/issues/480#issuecomment-1517618697