sparna-git / Sparnatural

Sparnatural : Typescript visual SPARQL query builder for knowledge graphs, configurable with SHACL
http://sparnatural.eu
GNU Lesser General Public License v3.0
214 stars 38 forks source link

Idea : how-to implement the SERVICE keyword for federated querying #352

Closed tfrancart closed 1 year ago

tfrancart commented 1 year ago

My ideas to implement the "SERVICE" keyword:

Step 1: a new option similar to "optional" and "not exists" is proposed on properties where federation makes sense:

image

Step 2: when user clicks on "Service..." a dropdown is displayed with available federation services

image

Step 3: user selects a federation service, and the "service" arrow is renamed to the selected service and gets activated

image

We note that the SERVICE would apply to the whole "subtree" like the optional or not exists.

To do that this would require:

  1. Extending the config with the ability to declare the available federation services with their labels and URLs
  2. Annotating properties with a "enableService" annotation that would point to the available federation services that make sense for this query
  3. Implementing the new "Service" option arrow
  4. Defining a new line around the "serviced" criteria lines (we have dotted orange for optional, dotted black for not exists, something like dotted blue for service)
  5. Wrapping the corresponding SPARQL criteria around the SERVICE keyword, which would be similar to wrapping them around an OPTIONAL or NOT EXISTS
  6. Extending the JSON data structure to express the SERVICE
  7. Be able to write and read the JSON data structure to save/load the queries
tfrancart commented 1 year ago

The question of the population of lists or autocomplete widgets also needs to be dealt with. If a Service is selected for a property then:

  1. any selected value for the property must be removed
  2. the widget must be reinitialised by using the selected SPARQL endpoint URL

But this imply that the data is homogeneously represented in all possible services, as the query of the datasource remains the same. To be completely open, the property should be associated to multiple datasources, each associated to a different endpoint.

A possibility for configuration is the following:

  1. config:enableService is used to annotate property and point to a Service identified by its URL
  2. the property is associated to multiple datasources, each associated to a service URL
  3. when selected the Service from the dropdown menu, the selected datasource is the one with the same URL as the selected Service
  4. If no service is selected, the datasource with no service is selected
SteinerPascal commented 1 year ago

Regarding https://github.com/sparna-git/Sparnatural/issues/352#issue-1376450415 Nr: 1 Does this mean optional/notexists/service are exclusive of each other? Is it not possible to use the OPTIONAL when the SERVICE arrow is selected?

Regarding https://github.com/sparna-git/Sparnatural/issues/352#issue-1376450415 Nr.2: I'm not sure wether it is wise to task the user to choose between different endpoints. The biggest strength of Sparnatural is that it is easy to query. Even for users without any SPARQL knowledge. The SERVICE keyword (in my personal view) implies some more technival Knowledge about SPARQL. The notexists/ optional can still be interpreted without knowledge of SPARQL. I tend to a solution where the selection of the SERVICE keyword is transparent and not visible to the user. That means the configuration should be provided in a manner, so it's not necessary for the user to select an endpoint. So maybe in the beginning we can just say "if the serviceEnabled property is activated, one must provide one sparqlEndpointUrl. And that information is then fetched from a different endpoint" This is based on the assumption that the person who configures Sparnatural knows the default KG and that certain information which is not in this KG should be on a different endpoint.

This way it would also be possible to keep using the optional/notexists together with the service keyword.

If then further use cases arise where advanced users would like to choose the endpoint (which implies the user knows the different information contained in the selectable KGs. Since he/she must determine if he wants the label (or so) from wikidata or dpedia), the we can still implement it.

tfrancart commented 1 year ago

Does this mean optional/notexists/service are exclusive of each other? Is it not possible to use the OPTIONAL when the SERVICE arrow is selected?

Yes my assumption is that they are exclusive. Also for visual reason (it would be hard to find a border mixing combination of e.g. optional + service)

I'm not sure wether it is wise to task the user to choose between different endpoints. The biggest strength of Sparnatural is that it is easy to query. Even for users without any SPARQL knowledge. The SERVICE keyword (in my personal view) implies some more technival Knowledge about SPARQL.

I agree that we can limit it to a single service per property, so that the user does not have to choose from different remote services (or we could also simply say : "if only one service is configured on the property, then don't show the service dropdown, otherwise show it"). We could have nice 3 arrows "optional / negative / remote".

This way it would also be possible to keep using the optional/notexists together with the service keyword.

Isn't it also an advanced use-case ? I suggest we don't allow the combination of optional/not exists with service, so that we can use the same visual green arrow components. Otherwise we need to find another visual solution.

SteinerPascal commented 1 year ago

My suggestion is actually to not have any visual components at all for the SERVICE keyword. It only depends on the person who configures Sparnatural to decide when the SERVICE keyword is injected. In my opinion the person configuring also knows what is in "my" KG and what information is in the "other" KG. I assume that the user does not know anything about the underlying structure. But the person configuring Sparnatural does. As far as the user sees it, he/she does not even realize there are multiple KGs involved.

Additionally, imagine following scenario: There are 2 KGs (KG_A, KG_B) deployed under different endpoints from different organizations. KG_A is the default endpoint and KG_B is a remote (federated) endpoint.

A user goes to Sparnatural and sees the Service arrow rendered when he/she starts to build a query. The service btn asks if the query should be done on KG_B since this KG is configured as service in the config. Now how does the user know if he/she should click the arrow? The user probably doesn't even know its querying KG_A. This scenario implies the user being aware of multiple things:

  1. I'm aware that I'm querying a database (KG_A) and that there are other KGs from where i can get information from.
  2. I know which information is in KG_A and that in the solution set, I don't want it from KG_A but from KG_B
  3. I know what a SERVICE endpoint is (this might even confuse tech savy people if they are not familiar with SPARQL).
  4. I know that if I click the SERVICE, that my query might run empty cause the information was not found in KG_B. But it then might be succesful again if I don't select the SERVICE.
  5. I know what information is stored in KG_B and that i would like to have my information from there.

Of course with some different naming for the SERVICE keyword, it might be easier to understand. But even then I'm certain its going to be difficult for people choosing between different KGs.

Isn't it also an advanced use-case ? I suggest we don't allow the combination of optional/not exists with service, so that we can use the same visual green arrow components. Otherwise we need to find another visual solution.

Yes it is a bit advanced but again only for the person configuring Sparnatural. The user doesn't even realize he/she is querying multiple KGs. They can build their query as if it would be one single KG (which I think, is the true strength of the SERVICE keyword).

Instead of an arrow i would recommend the following workaround for selecting endpoints: The problem: It is not possible to select between the endpoints on runtime. A decision has to be made beforhand. The solution: You configure multiple Object Properties. one saying getSkillFromDPedia and the other getSkillFromYourOrganisation. And then you can configure them with different endpoints and if necessary different datasources. This also doesn't exclude OPTIONAL or NOTEXISTS

tfrancart commented 1 year ago

Yes yes all of this makes total sense. I agree with the approach. It would be nice though to visually render the fact that a criteria is executed remotely, so that the user knows, once it is selected; this is not blocking though. I suggest we could use the same rendering as with optional/not exists, but with a difference color (light blue ?)

I see 2 questions now : 1/ the configuration and 2/ the query execution ordering.

Configuration

In order to configure the service I suggest that we rely on the SPARQL service description vocabulary Service class, coupled with a new sparnatural-core:sparqlService annotation.

@prefix sd: <http://www.w3.org/ns/sparql-service-description#> .

<http://data.mydomain.org/ontology/sparnatural-config#DBPediaService> a sd:Service ;
  sd:endpoint <https://dbpeida/org/sparql> ;
  rdfs:label "DBPedia (english)" ;
.

<http://data.mydomain.org/ontology/sparnatural-config#getSkillFromDBPedia> a owl:ObjectProperty ;
  rdfs:subPropertyOf config-core:ListProperty ;
  config-core:sparqlService 
<http://data.mydomain.org/ontology/sparnatural-config#DBPediaService> ;
  rdfs:label "skills from DBPedia"@en ;
.

And I suggest that the existing config-core:sparqlEndpointUrl is deprecated in favor of config-core:sparqlService - which imply taking one extra hop when reading the config to read the sd:service property and get the actual service URL. The advantage is that we can attach labels and other metadata to the service itself.

Query execution ordering

Query execution ordering is key when working with SERVICE, and imply using subqueries. Subqueries are the only way to control the ordering of query execution. There are 2 scenarios involving SERVICE and requiring a different ordering in query execution:

SERVICE as a an additional criteria (executed before main query)

Typical use-case : "I want all the Museums located in Country where [SERVICE] Country part of Europe [end SERVICE]"

We want the "Country part of Europe" criteria executed before, and then joined with our local list of countries. The SERVICE keyword needs to be put in a subquery.

SELECT ?museum
WHERE {
  ?museum a ex:Museum .
  ?museum ex:locatedIn ?country .
  ?country a ex:Country .
  {
    SELECT ?country
    WHERE {
       SERVICE <https://dbpedia.org/sparql> {
         ?country dbo:partOf dbpedia:Europe . 
       }
    }
  }
}

Which also implies that the datasource for dbo:partOf needs to be associated with the DBPedia service to properly fetch value dbpedia:Europe.

SERVICE to fetch additional metadata (executed after main query)

Typical use-case : "I want all the Museums located in Country where Country part of Europe and [SERVICE] Country has population Number [end SERVICE]"

We want the "Musems located in Country part of Europea" criteria executed before, and then we want the SERVICE clause executed to fetch the population of those Countries only (and not all Countries)

# population is not a criteria, it is just an extra column in the result set
SELECT ?museum ?country ?population
WHERE {
  {
  # putting a star here as we don't know which variable needs to be selected or joined (?)
  SELECT *
  WHERE {
    ?museum a ex:Museum .
    ?museum ex:locatedIn ?country .
    ?country a ex:Country .
    ?country ex:partOf ex:Europe .
  }
  }

 # The ?country variable will be bound to the values fetched by the subquery, so only the popluation of
 # these countries will be fetched
 SERVICE <https://dbpedia.org/sparql> {
   ?country dbo:population ?population .
  }
}

Discussion

How do we know in which situation we are ? if there is no filtering criteria (no value selected, only the eye clicked), than we are in the second situation. Otherwise we are in the first. Which situation is the most common ? I don't know. In case of doubt, what should we do ? I think, always execute the local criteria first (so, second alternative above), to avoid sending to a potentially large remote triplestore an unbound criteria fetching too much values and killing the query.

SteinerPascal commented 1 year ago

Configuration: Yes, I like the configuration proposal a lot.

Query execution ordering: Okay but the query execution ordering is purely an optimization process. The final result set will be the same between those two queries:

SELECT ?museum
WHERE {
  ?museum a ex:Museum .
  ?museum ex:locatedIn ?country .
  ?country a ex:Country .
  {
    SELECT ?country
    WHERE {
       SERVICE <https://dbpedia.org/sparql> {
         ?country dbo:partOf dbpedia:Europe . 
       }
    }
  }
}

OR

SELECT ?museum
WHERE {
  ?museum a ex:Museum .
  ?museum ex:locatedIn ?country .
  ?country a ex:Country .
   SERVICE <https://dbpedia.org/sparql> {
     ?country dbo:partOf dbpedia:Europe . 
   }
}

My thoughts: First of all, as far as I know the query execution plan is always up to the implementation and is not set by the SPARQL standard. Your assumption that Subqueries are always executed first is technically not correct. (or maybe you can point me to a resource where it is defined? I'm not very certain about this...). In the SPARQL 1.1 standard it is defined :

Due to the bottom-up nature of SPARQL query evaluation, the subqueries are evaluated logically first, and the results are projected up to the outer query.

This actually only says that subqueries are logically evaluated first. It doesn't say anything about how to technically retrieve these values. I'm actually not experienced in query optimization and I don't how the most common implementation handel these queries. Propably the implementation mirrors the logical behavior and the subquery is evaluated first most of the times.

Anyway, I don't think it is wise to engage in query optimization. The query plan behind Sparnatural can change depending of the implementation. Query optimization is not really a responsibility Sparnatural should handle but the implementation. If query execution planning and optimization is needed I would suggest looking for the proper execution engine. I saw some execution engines allow for query planning and optimization in their configs: https://docs.aws.amazon.com/neptune/latest/userguide/sparql-query-hints.html https://www.stardog.com/blog/7-steps-to-fast-sparql-queries/ https://github.com/blazegraph/database/wiki/QueryHints https://vos.openlinksw.com/owiki/wiki/VOS/VirtTipsAndTricksAnalyzingSPARQLQuery

I propose the following: Implementing SERVICE without optimization first. (leaving it up to the implementation of the query execution engine) Then maybe we can add queryoptimization by having a configuration property or so. Or the optimization needs a seperate piece of software which handles that. Or you just tweak the output of Sparnatural

What do you think?

tfrancart commented 1 year ago

Due to the bottom-up nature of SPARQL query evaluation, the subqueries are evaluated logically first, and the results are projected up to the outer query.

This actually only says that subqueries are logically evaluated first. It doesn't say anything about how to technically retrieve these values.

This is what I meant. Everywhere I wrote "executed first", please read "logically executed first". This is the only thing we care about. I don't care about query optimization, and this is not about using proprietary hints of triplestore implementations. I just care about query not ending in timeout, and I care about the capability of Sparnatural to properly demonstrate federated queries, because most of the time attempt in demonstrating this end up in a failure.

Of course, we can progress step-by-step, write the vanilla query first, hit a wall, and then climb the wall to progress, so I agree with your proposal of implementing the SERVICE without anything else.

See this discussion on the SPARQL 1.2 group where I actually learned about the "subquery trick" : https://github.com/w3c/sparql-12/issues/21

SteinerPascal commented 1 year ago

Okay cool then let's start simple first and work our way up. I'll proceed with the implementation then.