Point of use of rml:query

pmaria commented 1 year ago

I see several examples where rml:query is used as a property of rml:Source.

And other examples where rml:query is used as a property of rml:LogicalSource.

My preference would be to have rml:query be a property of rml:LogicalSource. Because:

this would allow to reuse the source description for multiple queries
this would also be more in line with the behavior of rml:iterator, in that the evaluation of an rml:iterator produces a list of Records, and the evaluation of rml:query also produces a list of Records (in the case of relational databases: rows). Having these on the same resource type would simplify implementations.

DylanVanAssche commented 1 year ago

Currently, it is in inside rml:Source, those in rml:LogicalSource are a typo.

this would allow to reuse the source description for multiple queries

True, this is not possible right now.

this would also be more in line with the behavior of rml:iterator, in that the evaluation of an rml:iterator produces a list of Records, and the evaluation of rml:query also produces a list of Records (in the case of relational databases: rows). Having these on the same resource type would simplify implementations.

That was before in R2RML, in RML it is not fixed that a query produces rows. It can also be a SPARQL query which can have results in JSON, XML, CSV, TSV. So in that case, you can have an rml:query in rml:Source with an interator and referenceformulation inrml:LogicalSource. Moreover, some SQL RDBs support also outputting their results in different formats like XML. Because of this, I moved it as an 'access' thing because query results is not an iterable thing anymore except when a reference formulation and iterator is provided if the results are not tabular records. However, I'm open to change it, if we can make it work in all cases besides relational databases.

Having these on the same resource type would simplify implementations.

I cannot follow here, where the query is should not matter for implementations? In the end, everything is just a language which can be translated in other languages. For example: an implementation could understand RML, YARRRML, and SPARQL-Generate which it all maps it on its internal implementation to execute the instructions.

pmaria commented 1 year ago

Currently, it is in inside rml:Source, those in rml:LogicalSource are a typo.

Ah ok.

this would allow to reuse the source description for multiple queries

True, this is not possible right now.

this would also be more in line with the behavior of rml:iterator, in that the evaluation of an rml:iterator produces a list of Records, and the evaluation of rml:query also produces a list of Records (in the case of relational databases: rows). Having these on the same resource type would simplify implementations.

That was before in R2RML, in RML it is not fixed that a query produces rows. It can also be a SPARQL query which can have results in JSON, XML, CSV, TSV. So in that case, you can have an rml:query in rml:Source with an interator and referenceformulation inrml:LogicalSource. Moreover, some SQL RDBs support also outputting their results in different formats like XML. Because of this, I moved it as an 'access' thing because query results is not an iterable thing anymore except when a reference formulation and iterator is provided if the results are not tabular records.

OK interesting. But then your reference formulation would have be one to reference one of those resulting formats, not the sql formulation, right? So how does that work? What would such a mapping look like?

In any case I think we need to describe these use cases.

However, I'm open to change it, if we can make it work in all cases besides relational databases.

Having these on the same resource type would simplify implementations.

I cannot follow here, where the query is should not matter for implementations? In the end, everything is just a language which can be translated in other languages. For example: an implementation could understand RML, YARRRML, and SPARQL-Generate which it all maps it on its internal implementation to execute the instructions.

Well my point is: there should be a single point in the language which produces record (the items on which the references are evaluates against). I think this should be the rml:LogicalSource in this case, and not the rml:Source.

This would keep it simple for implementations as well, in the sense that you can expect the logical source to describe how records are generated from a source. And the source to just describe the static aspects of the source.

So then the question is: what does an rml:query produce? Does it produce records, or is it indeed always part of an rml:Source from which you create new records using a reference formulation and an (implicit) iterator?

IMO an rml:iterator is not essentially different from a rml:query. Both are essentially expressions in some reference formulation that result in a list of items. So to me it feels like something that should be at the same level in the language.

DylanVanAssche commented 1 year ago

OK interesting. But then your reference formulation would have be one to reference one of those resulting formats, not the sql formulation, right? So how does that work? What would such a mapping look like?

Yes, the reference formulation must be able to iterate over the results. For example: SPARQL JSON results will have a JSONPath reference formulation and JSONPath iterator. Describing these cases could benefit the spec indeed, at least adding some of examples of SQL vs SPARQL.

Mapping:

<#SDSourceAccess> a rml:Source, sd:Service;   
  sd:endpoint <http://example.com/sparql/>;   
  sd:supportedLanguage sd:SPARQL11Query;      
  sd:resultFormat formats:SPARQL_Results_CSV; 
  rml:query """                               
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>   

  SELECT ?name ?age WHERE {                   
    ?person foaf:name ?name .                 
    ?person foaf:age ?age .                   
  }                                           
  """;                                        

<#TriplesMap> a rml:TriplesMap;                     
  rml:logicalSource [ a rml:LogicalSource;          
    rml:source <#SDSourceAccess>; 
    rml:referenceFormulation ql:JSONPath;
    rml:iterator "$.results.bindings[*]"               
  ];                                                
  rml:subjectMap [ a rml:SubjectMap;                
    rml:template "http://example.org/{id.value}";         
  ];                                                
  rml:predicateObjectMap [ a rml:PredicateObjectMap;
    rml:predicateMap [ a rml:PredicateMap;          
      rml:constant foaf:name;                       
    ];                                              
    rml:objectMap [ a rml:ObjectMap;                
      rml:reference "name.value";                         
    ];                                              
  ];

Well my point is: there should be a single point in the language which produces record (the items on which the references are evaluates against). I think this should be the rml:LogicalSource in this case, and not the rml:Source.

Agreed! That's why I moved it, for me the iterator is the one that produces records. The query is only a way to select a part of the source, but doesn't generate records on its own, it gives a result set over which an iteration must be applied over. In R2RML it was assumed that iterating over the results is done on a row-basis which we cannot do for other query languages. Let's say GraphQL, NoSQL-like, SPARQL, etc.

So then the question is: what does an rml:query produce? Does it produce records, or is it indeed always part of an rml:Source from which you create new records using a reference formulation and an (implicit) iterator?

RML query produces a result set, how that result set looks like is kinda depending on the source, hence access. The iterator and reference formulation iterate over this result set for the engine so you get the necessary records.

pmaria commented 1 year ago

OK, thanks for the clarifications.

That leaves me with the following concerns.

We cannot reuse a database source over multiple queries. Engines will have to extract the query form the source description to determine which queries should be sent to the same database.
The iterator, when specified (non-row) is also able to limit the source, thus behaving similarly as a query. This, to me, begs the question whether there is a distinction at all.

<#SDSourceAccess> a rml:Source, sd:Service;   
  sd:endpoint <http://example.com/sparql/>;   
  sd:supportedLanguage sd:SPARQL11Query;      
  sd:resultFormat formats:SPARQL_Results_CSV; 
  rml:query """                               
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>   

  SELECT ?name ?age WHERE {                   
    ?person foaf:name ?name .                 
    ?person foaf:age ?age .                   
  }                                           
  """;                                        

<#TriplesMap> a rml:TriplesMap;                     
  rml:logicalSource [ a rml:LogicalSource;          
    rml:source <#SDSourceAccess>; 
    rml:referenceFormulation ql:JSONPath;
    rml:iterator "$.results.bindings[*]"               
  ];                                                
  rml:subjectMap [ a rml:SubjectMap;                
    rml:template "http://example.org/{id.value}";         
  ];                                                
  rml:predicateObjectMap [ a rml:PredicateObjectMap;
    rml:predicateMap [ a rml:PredicateMap;          
      rml:constant foaf:name;                       
    ];                                              
    rml:objectMap [ a rml:ObjectMap;                
      rml:reference "name.value";                         
    ];                                              
  ];

Next to this, it is not clear to me how the above example would be expressed for relational databases, since rml:referenceFormulation is a property of rml:LogicalSource. The above example uses sd:supportedLanguage sd:SPARQL11Query. What do you use for other source types?

DylanVanAssche commented 1 year ago

@pmaria Maybe we should consider indeed a query as an iterator and use referenceFormulation something like rml:SQL2008 which indicate: (i) query as iterator is following SQL2008 and (ii) implies also how to refer to columns in the query. For SPARQL same: query is put as iterator and then referenceformulation says: formats:SPARQL_CSV_Results which then also says how to refer to the SPARQL results. Your argument @pmaria makes sense, and I'm getting convinced of this actually.

I know @andimou has also an opinion on this, I will wait for her as well before changing things.

DylanVanAssche commented 1 year ago

@pmaria Maybe we should consider indeed a query as an iterator and use referenceFormulation something like rml:SQL2008 which indicate: (i) query as iterator is following SQL2008 and (ii) implies also how to refer to columns in the query. For SPARQL same: query is put as iterator and then referenceformulation says: formats:SPARQL_CSV_Results which then also says how to refer to the SPARQL results. Your argument @pmaria makes sense, and I'm getting convinced of this actually.

I know @andimou has also an opinion on this, I will wait for her as well before changing things.

@pmaria @andimou Do we have an agreement here?

Basically, we would allow then to put the query in rml:iterator and set rml:referenceFormulation to rml:SQL2008, formats:SPARQL_CSV_Results, etc. ? rml:query would then be dropped.

If so, I can adjust the spec, testcases, etc.

pmaria commented 1 year ago

+1 from me

DylanVanAssche commented 1 year ago

Discussed during W3C CG meeting:

Support rr:tableName as: rml:referenceFormulation rml:SQL2008Table; rml:iterator "myTable"
Support rr:sqlQuery as rml:referenceFormulation rml:SQL2008Query; rml:iterator "SELECT column from myTable"

Unrelated: typo in testcase 6f

kg-construct / rml-io

Point of use of rml:query #28