Open tirrolo opened 8 months ago
Yes! We (me and @pmaria) spotted this a few ago.
In this example, we should have used SELECT
indeed.
However, we still need to figure out how to deal with CONSTRUCT
.
Suggestions & Pull Requests are welcome!
Fixed the CONSTRUCT
in main
I updated the title to reflect better the important issue here: SPARQL CONSTRUCT support.
Any suggestions on how to add that, would be highly appreciated!
CC: @pmaria
Do we really need to accept construct queries? Are we accepting ASK, and DESCRIBE as well?
Construct already retrieves RDF, and we would need to define a reference formulation for RDF. Do we want to open that box? Where is the use-case/necessity?
If we allow SPARQL descriptions, we cannot restrict the type of queries. Whatever the SPARQL descriptions' recommendation allows is a potential entry.
If we allow SPARQL descriptions, we cannot restrict the type of queries. Whatever the SPARQL descriptions' recommendation allows is a potential entry.
Why is that?
Could we not define the details of a reference formulation where we stipulate that e.g. only SELECT and ASK queries are allowed?
The reference formulation refers to how we access the data which is available in a logical source. How the data in the Logical Source were retrieved is beyond the scope of the Reference Formulation.
In this case, all what we say is that the data of the Logical Source is retrieved from a SPARQL endpoint which is described via a SPARQL description. If we do not want all data from the SPARQL endpoint, we may define a query but in this case, the query is supposed to be any SPARQL query. One can say in an implementation that I support only SELECT and ASK queries but this is beyond RML. The Reference Formulation would only tell you how to process the data after the SELECT or ASK query but it would not indicate what the results of the query would be or the format of the data in the Logical Source.
The reference formulation refers to how we access the data which is available in a logical source. How the data in the Logical Source were retrieved is beyond the scope of the Reference Formulation.
In this case, all what we say is that the data of the Logical Source is retrieved from a SPARQL endpoint which is described via a SPARQL description. If we do not want all data from the SPARQL endpoint, we may define a query but in this case, the query is supposed to be any SPARQL query. One can say in an implementation that I support only SELECT and ASK queries but this is beyond RML. The Reference Formulation would only tell you how to process the data after the SELECT or ASK query but it would not indicate what the results of the query would be or the format of the data in the Logical Source.
I respectfully disagree. Any reference formulation should define:
If, for SPARQL, we can do this all in 1 reference formulation, that's great. But, currently it is quite unclear how that would work.
I respectfully agree with all what you say but none of what you say has anything to do with the SPARQL query but with all what comes after the SPARQL query. Either you have a SPARQL query or not, what you say should be defined in a reference formulation, we do not disagree on this. But how we fetch the data from a data source is independent of the reference formulation. If one used a SPARQL query or not to retrieve a set of RDF triples is independent of how one refers to these triples. If one used a SELECT query to retrieve some CSV results, then the reference formulation refers to these CSV results and not to the SPARQL query.
[ ... ] But how we fetch the data from a data source is independent of the reference formulation.
This is where we are disagreeing. How we are fetching data from a source (the rml:iterator
) is most definitely part of the definition of the reference formulation and not independent of it.
The rml:iterator
does not specify how we fetch the data of the logical source but how we iterate over the data we retrieved.
How an iteration is computed is indeed part of the reference formulation, but that is something different than how the data is retrieved (the SPARQL query in this case).
What an iteration returns (independently of which reference formulation we use) is a set of key-value pairs that RML can then consider to create the RDF triples of each iteration.
The
rml:iterator
does not specify how we fetch the data of the logical source but how we iterate over the data we retrieved.
I must still disagree with this statement.
Take these examples
rml:referenceFormulation rml:XPath;
rml:iterator "/xpath/iterator/expression";
rml:referenceFormulation rml:JSONPath;
rml:iterator "$.jsonpath.expression";
rml:iterator "SELECT * FROM student;";
rml:referenceFormulation rml:SQL2008Query;
rml:iterator "SELECT { s? ?p ?o } WHERE { ?s ?p ?o. } LIMIT 100";
rml:referenceFormulation formats:SPARQL_Results_CSV;
All these iterator expressions are specifying what data is to be considered part of the iteration. This includes how we fetch the data. How the results are formed into a logical iteration is not part of the iterator expression, but part of the (currently implicit) rules of the reference formulation. Of course, the iterator must provide an iterable result in accordance with the rules of the reference formulation for a logical iteration to be formed.
There are also logical sources where an iterator is not necessary, since there is a natural way to form a logical iteration on those sources, like with CSV. But again, here, how we iterate is determined by the rules of the reference formulation.
This discussion is an example of why we need more clarity on the definition of the reference formulations. For example, how are logical iterations formed on a JSONPAth expression response? This is currently not specified anywhere.
The same question can be asked for SPARQL CONSTRUCT queries, to bring it back to this issue.
What an iteration returns (independently of which reference formulation we use) is a set of key-value pairs that RML can then consider to create the RDF triples of each iteration.
I do not believe that we can say that an iteration always returns key-value pairs. This is very much dependent on the reference formulation and how reference expression are to be evaluated against the logical iterations in that reference formulation.
I would say an iteration returns a logical iteration where each iteration is a sub-part of the source on which reference expressions can be evaluated to return values from the source data. How this works exactly should also be defined in the reference formulation.
For example, in the rml:JSONPath
reference formulation both the iterator and the reference expressions use the JSONPath standard query language. Thus, the logical iteration consists of sub-documents of the JSON source on which subsequent reference expressions in the JSONPath language can be evaluated.
However, the rml:SQL2008Query
reference formulation uses SQL for the iterator and column names as reference expressions. This could indeed be defined as key-value pairs, but that is probably too simplistic in the case of RDBs.
The point is that every reference formulation needs to define these specific aspects.
EDIT: Pano already provided a more detailed answer.
The
rml:iterator
does not specify how we fetch the data of the logical source but how we iterate over the data we retrieved.
At the moment it looks a bit mixed to me. Consider the following example from the IO spec:
<#RDB> a rml:LogicalSource;
rml:source [ a rml:Source, d2rq:Database;
d2rq:jdbcDSN "jdbc:mysrml://localhost/example";
d2rq:jdbcDriver "com.mysql.jdbc.Driver";
d2rq:username "user";
d2rq:password "password";
];
rml:iterator "SELECT * FROM student;";
rml:referenceFormulation rml:SQL2008Query;
To me, here the iterator is stating how the data is actually fetched. Maybe there is something I misunderstood?
This use if rml:iterator
is new to me.. when was that agreed?
Ah, I think you are referring to rml:query
which was used on some rml:Source
descriptions.
That was this issue: #28
ok I missed this, is it too late that I disagree on this use of the rml:iterator
?!
the iterator was meant to indicate how we "traverse" the data, how we iterate over the data we have, it's a pattern that repeats in the data. What pattern can you have if the iterator is a query?
the iterator was meant to indicate how we "traverse" the data, how we iterate over the data we have, it's a pattern that repeats in the data.
The way an iterator is currently used is as followed:
This was discussed during a W3C CG meeting and was accepted.
What pattern can you have if the iterator is a query?
You could argue that the iterator was always a query.
For example an XPATH expression is basically a query on a document. Same goes for JSONPath. So the distinction between iterator and query was always questionable, as is argued in #28.
It is the reference formulation that determines whether the result of the expression/query is iterable, and how it should be iterated.
You could argue that the iterator was always a query.
For example an XPATH expression is basically a query on a document. Same goes for JSONPath. So the distinction between iterator and query was always questionable, as is argued in #28.
It is the reference formulation that determines whether the result of the expression/query is iterable, and how it should be iterated.
Which brings us back to the original issue. It is unclear to me how an "iteration" would be performed for CONSTRUCT
queries. Probably we will have to define that at some point, though. At the moment, we are in a quite bizarre situation where we accept JSON
, CSV
, and XML
files as input. Not Turtle files, for instance. But can we really justify that? At a first sight it looks a bit arbitrary, and I see this related to the issue of the CONSTRUCT
.
Maybe we could argue that for Turtle and other formats we do not really have a reference formulation, that is, a query language that we can use to access file elements. But I am unsure.
You could argue that the iterator was always a query.
this is not 100% correct. An iterator in the case of tables in R2RML is not a query. It just happens that the query language is used as the reference formulation for some cases and that's why the iterator may be a query.
However, having an SQL query as an iterator, that would return a table, this is not an iteration pattern or well, it is, but it has only 1 iteration, the complete table because this table does not repeat within a table.
It is unclear to me how an "iteration" would be performed for CONSTRUCT queries.
this is correct. I think when we proposed RDF as input for the first time, we considered the iteration pattern to be every triple. That was never mentioned anywhere nor specified. If we accept RDF as potential input, then we need a reference formulation to refer to the RDF triples/quads and that reference formulation would also give us the iteration pattern.
You could argue that the iterator was always a query.
this is not 100% correct. An iterator in the case of tables in R2RML is not a query. It just happens that the query language is used as the reference formulation for some cases and that's why the iterator may be a query.
However, having an SQL query as an iterator, that would return a table, this is not an iteration pattern or well, it is, but it has only 1 iteration, the complete table because this table does not repeat within a table.
It may have not been intended in that way, but I believe the JSONPath and XPath reference formulations were the only defined reference formulations for which the iterator was actually specified in the mappings in earlier versions of RML. In those versions no iterator would have been specified for any type of SQL logical source. So in practice the iterator was always an expressed query for these formulations. Hopefully this clarifies my point.
Another way to look at it: we could replace rml:iterator
with rml:query
and the mappings would still make sense. Maybe even more so.
Consider example in specification:
Since the iterator uses a
CONSTRUCT
, the reference formulation format cannot beSPARQL_Results_CSV
. Suggest either using a SPARQLSELECT
form, or change the reference formulation format(?).