ncbo / goo

Graph Oriented Objects (GOO) for Ruby. A RDF/SPARQL based ORM.
http://ncbo.github.io/goo/
Other
15 stars 6 forks source link

Refactor: Optimize the generated SPARQL queries #124

Closed syphax-bouazzouni closed 2 months ago

syphax-bouazzouni commented 1 year ago

Preamble

This PR includes this

And require this

Starting issues

The generated SPARQL query to fetch data is not optimized because in the select clause we print all the attributes which causes the result to have a lot of rows (the number of rows is equal to the multiplication of the number of results of each attribute)

For example, if we have for a resource (e.g a Class) 3 attributes (e.g prefLabel, synonyms, definitions) with the following values: prefLabel = prefLabel synonyms = [syn_1, syn_2, syn_3] definitions = [def_1, def_2]

We will get the following request results :

| prefLabel | syn_1 | def_1 | | prefLabel | syn_2 | def_1 | | prefLabel | syn_3 | def_1 | | prefLabel | syn_1 | def_2 | | prefLabel | syn_2 | def_2 | | prefLabel | syn_3 | def_2 |

With this new implementation, we will have the following results prefLabel syn_1 syn_2 syn_3 def_1 def_2

The number of rows is equal to the multiplication of the number of results of each attribute, so if we have 4 attributes with each one 10 values, the result will be 10000 lines (10X10X10X10), whereas with this new implementation it is only 40 lines

Solution

We will transform all the requests to

SELECT DISTINCT ?id ?attributeProperty ?attributeObject 
FROM <graph>
WHERE { 
    ?id a <type_uri> . 
  OPTIONAL { ?id ?attributeProperty ?attributeObject . }
FILTER(?id = <id_uri>) 
FILTER(?attributeProperty = <attribute_1_uri> || ... || ?attributeProperty = <attribute_n_uri>)

And for inverse attributes, we will use the following request

SELECT DISTINCT ?id ?attributeProperty ?attributeObject
FROM <graph> 
WHERE { 
 ?id a <type_uri> . 
 OPTIONAL { ?id ?attributeProperty ?attributeObject . } 
 OPTIONAL { ?attributeObject ?attributeProperty ?id . } // added line
FILTER(?id = <id_uri>) 
FILTER(?attributeProperty = <attribute_1_uri> || ... || ?attributeProperty = <attribute_n_uri>)

Solution issue

Theoretically, it should work, but the goo test_inverse.rb doesn’t pass because of a bug of 4store, where the ?attributeproperty was always returning resource not found

To reproduce the bug

Alternative and implemented solution for now

We will use UNION with binds for each attribute of a resource, as so

Include/general query

SELECT DISTINCT ?id ?attributeProperty ?attributeObject
FROM <graph_i> # for each graph from 1..n
WHERE { 
    ?id a <type_uri> . 
    OPTIONAL {

          { ?id <attribute_1_uri> ?attributeObject . BIND("var_1" as ?attributeProperty) } 
            UNION
                ....
            UNION
          { ?attributeObject <attribute_j_uri> ?id . BIND( "var_j" as ?attributeProperty)} # if attribute_j_uri is an inverse attribute 
          UNION 
              ....
            { ?id <attribute_n_uri> ?attributeObject . BIND("var_n" as ?attributeProperty) } 
    }

FILTER(?id = <id_uri>) 
}

Aggregate query

unchanged

BNODE query

unchanged

SELECT DISTINCT ?id ?line1 ?line2 
FROM <http://goo.org/default/PersonPersistent>
 WHERE { 
   ?id a <http://goo.org/default/PersonPersistent> . 
   ?id <http://goo.org/default/contact_data> _:g64460 . 
   _:g64460 <http://goo.org/default/line1> ?line1. 
   _:g64460 <http://goo.org/default/line2> ?line2 . 
   FILTER(?id = <http://goo.org/default/person_persistent/Lewis> || ?id = <http://goo.org/default/person_persistent/John>) }

FILTER query (test/test_where#test_filter)

SELECT DISTINCT ?id ?attributeProperty ?attributeObject 
FROM <graph> 
WHERE { 
      ?id a <type_uri> . 
    ?id <attribute_uri> ?internal_join_var_0 . #can alos be optional
    FILTER(?internal_join_var_0 > "var")
}

ORDER BY query (TODO)

SELECT DISTINCT ?id ?attributeProperty ?attributeObject 
FROM <graph> 
WHERE { 
      ?id a <type_uri> . 
    ?id <attribute_uri> ?internal_join_var_0 . #can also be optional
}
ORDER BY ?internal_join_var_0

unmapped query (test/test_where#test_filter)

before

SELECT DISTINCT ?id ?object ?bind_as 
FROM <http://data.bioontology.org/ontologies/CSV_TEST_BRO/submissions/1> 
WHERE { 
            ?id a <http://www.w3.org/2002/07/owl#Class> . 
      { ?id <http://bioportal.bioontology.org/ontologies/umls/hasSTY> ?object . BIND ("var_hasSTY0" as ?bind_as) } 
      UNION  
      { ?id <http://www.w3.org/2004/02/skos/core#historyNote> ?object . BIND ("var_historyNote0" as ?bind_as) } 
      UNION  
      { ?id <http://data.bioontology.org/metadata/prefixIRI> ?object . BIND ("var_prefixIRI0" as ?bind_as) } 
      ....
      UNION  
      { ?id <http://www.w3.org/2000/01/rdf-schema#label> ?object . BIND ("var_label0" as ?bind_as) } 
      UNION  
      { ?id <http://www.w3.org/2000/01/rdf-schema#subPropertyOf> ?object . BIND ("var_subPropertyOf0" as ?bind_as) } 
      FILTER(?id = <http://bioontology.org/ontologies/BiomedicalResourceOntology.owl#Ontology_Development_and_Management> || ... list of all ids...  || ?id = <http://bioontology.org/ontologies/BiomedicalResourceOntology.owl#Database>) }

new it is the same as the general/includes a query

SELECT DISTINCT ?id ?attributeProperty ?attributeObject 
FROM <http://data.bioontology.org/ontologies/CSV_TEST_BRO/submissions/1> 
WHERE { 
            ?id a <http://www.w3.org/2002/07/owl#Class> . 
      { ?id <http://bioportal.bioontology.org/ontologies/umls/hasSTY> ?object . BIND ("var_hasSTY0" as ?bind_as) } 
      UNION  
      { ?id <http://www.w3.org/2004/02/skos/core#historyNote> ?object . BIND ("var_historyNote0" as ?bind_as) } 
      UNION  
      { ?id <http://data.bioontology.org/metadata/prefixIRI> ?object . BIND ("var_prefixIRI0" as ?bind_as) } 
      ....
      UNION  
      { ?id <http://www.w3.org/2000/01/rdf-schema#label> ?object . BIND ("var_label0" as ?bind_as) } 
      UNION  
      { ?id <http://www.w3.org/2000/01/rdf-schema#subPropertyOf> ?object . BIND ("var_subPropertyOf0" as ?bind_as) } 
      FILTER(?id = <http://bioontology.org/ontologies/BiomedicalResourceOntology.owl#Ontology_Development_and_Management> || ... list of all ids...  || ?id = <http://bioontology.org/ontologies/BiomedicalResourceOntology.owl#Database>) }

Equivalent predicates case

see `def self.expand_equivalent_predicates(query, eq_p)`

Before the filter was in the OPTIONAL clause

SELECT DISTINCT ?id ?attribute1
FROM <graph_i>
WHERE { 
    ?id a <type_uri> .
    OPTIONAL { ?id ?p ?attribute1 . FILTER(?p = <equivalent_uri_1> || .. || ?p = <equivalent_uri_n>) }
     ...
}

Now we add a UNION clause for each equivalent predicate

SELECT DISTINCT ?id ?attributeProperty ?attributeObject
FROM <graph_i>
WHERE { 
    ?id a <type_uri> .
    OPTIONAL { 
         {?id <equivalent_uri_1> ?attributeObject . BIND( "var_1" as ?attributeProperty)) }
                 UNION
         ...
         UNION
           {?id <equivalent_uri_2> ?attributeObject . BIND( "var_1" as ?attributeProperty)) }
     ...

}

Use case

With the use case of AGRVOC (2GB of triples) :

Old implementation New implementation
/ontologies/AGROVOC/submissions/1?display=all (132 properties) 24 minutes (1 032 240 lines returned) 4.3145 seconds (140 lignes returned)
syphax-bouazzouni commented 1 year ago

Hi @mdorf, I think you are the one working on this part of the project. Could you please give me your opinion about this?

mdorf commented 1 year ago

Hi @mdorf, I think you are the one working on this part of the project. Could you please give me your opinion about this?

@syphax-bouazzouni, the solution appears to be sound based on the description. But, honestly, it's a bit difficult for me to provide a constructive feedback based on the number of commits. I need to understand how the original issue affected our system and how the solution addresses it. I wonder if this is a good topic to review during our face-to-face workshop in Montpellier in September.

mdorf commented 1 year ago

Also, we need to make sure that ncbo/ontologies_linked_data and ncbo/ontologies_api tests pass with AllegroGraph with this change in place.

jonquet commented 1 year ago

Also, we need to make sure that ncbo/ontologies_linked_data and ncbo/ontologies_api tests pass with AllegroGraph with this change in place.

I do agree in principle to this. However, I don't think this should be a stopper (or slower) for not taking the change for 4store. This issue (of the cartesian product) has been raised 5 years ago with @vemonet and we have here the opportunity to fix this.

Basically we expect this to fix the call : https://data.bioontology.org/submissions and possibly even https://data.bioontology.org/submissions?display=all

syphax-bouazzouni commented 1 year ago

@syphax-bouazzouni, the solution appears to be sound based on the description. But, honestly, it's a bit difficult for me to provide a constructive feedback based on the number of commits. I need to understand how the original issue affected our system and how the solution addresses it. I wonder if this is a good topic to review during our face-to-face workshop in Montpellier in September.

Yeah, I understand. For this kind of matter, I am totally available to do specific meetings to speak about it. But yes we could also speak about it during the Ontoportal meeting, even though during that meeting I would prefer personally to speak more in a larger overview of our state and the improvements that we can do.

Also, we need to make sure that ncbo/ontologies_linked_data and ncbo/ontologies_api tests pass with AllegroGraph with this change in place.

For AllegroGraph, we have not yet set up an environment to test it, but naively I will say that it should also work for it because its basic SPARQL queries that didn't change a lot from before.

I do agree in principle to this. However, I don't think this should be a stopper (or slower) for not taking the change for 4store. This issue (of the cartesian product) has been raised 5 years ago with @vemonet and we have here the opportunity to fix this.

@jonquet for now the code for 4store and AllegroGraph are the same, so we can't isolate this improvement for only 4store

jonquet commented 2 months ago

Cc @vemonet with who we originally found the bug! And congrats to @syphax-bouazzouni for such an investigation and fix.

vemonet commented 2 months ago

Wow nice to see it finally got resolved for everyone! Good job @syphax-bouazzouni, that was a consequent task