SPARQL CONSTRUCT support

tirrolo commented 6 months ago

Consider example in specification:

<#SPARQLEndpoint> a rml:LogicalSource;
    rml:source [ a rml:Source, sd:Service;
        sd:endpoint  <http://example.com/sparql>;
        sd:supportedLanguage sd:SPARQL11Query;
    ];
    rml:iterator "CONSTRUCT WHERE { ?s ?p ?o. } LIMIT 100";
    rml:referenceFormulation formats:SPARQL_Results_CSV;
.

Since the iterator uses a CONSTRUCT, the reference formulation format cannot be SPARQL_Results_CSV. Suggest either using a SPARQL SELECT form, or change the reference formulation format(?).

DylanVanAssche commented 6 months ago

Yes! We (me and @pmaria) spotted this a few ago.

In this example, we should have used SELECT indeed. However, we still need to figure out how to deal with CONSTRUCT. Suggestions & Pull Requests are welcome!

DylanVanAssche commented 6 months ago

Fixed the CONSTRUCT in main

DylanVanAssche commented 6 months ago

I updated the title to reflect better the important issue here: SPARQL CONSTRUCT support.

Any suggestions on how to add that, would be highly appreciated!

CC: @pmaria

dachafra commented 6 months ago

Do we really need to accept construct queries? Are we accepting ASK, and DESCRIBE as well?

Construct already retrieves RDF, and we would need to define a reference formulation for RDF. Do we want to open that box? Where is the use-case/necessity?

andimou commented 5 months ago

If we allow SPARQL descriptions, we cannot restrict the type of queries. Whatever the SPARQL descriptions' recommendation allows is a potential entry.

pmaria commented 5 months ago

If we allow SPARQL descriptions, we cannot restrict the type of queries. Whatever the SPARQL descriptions' recommendation allows is a potential entry.

Why is that?

Could we not define the details of a reference formulation where we stipulate that e.g. only SELECT and ASK queries are allowed?

andimou commented 5 months ago

The reference formulation refers to how we access the data which is available in a logical source. How the data in the Logical Source were retrieved is beyond the scope of the Reference Formulation.

In this case, all what we say is that the data of the Logical Source is retrieved from a SPARQL endpoint which is described via a SPARQL description. If we do not want all data from the SPARQL endpoint, we may define a query but in this case, the query is supposed to be any SPARQL query. One can say in an implementation that I support only SELECT and ASK queries but this is beyond RML. The Reference Formulation would only tell you how to process the data after the SELECT or ASK query but it would not indicate what the results of the query would be or the format of the data in the Logical Source.

pmaria commented 5 months ago

The reference formulation refers to how we access the data which is available in a logical source. How the data in the Logical Source were retrieved is beyond the scope of the Reference Formulation.

In this case, all what we say is that the data of the Logical Source is retrieved from a SPARQL endpoint which is described via a SPARQL description. If we do not want all data from the SPARQL endpoint, we may define a query but in this case, the query is supposed to be any SPARQL query. One can say in an implementation that I support only SELECT and ASK queries but this is beyond RML. The Reference Formulation would only tell you how to process the data after the SELECT or ASK query but it would not indicate what the results of the query would be or the format of the data in the Logical Source.

I respectfully disagree. Any reference formulation should define:

How to formulate an logical iteration on a source
How to resolve a reference expression on a logical iteration
If possible, how to naturally map values from the logical iteration to RDF values.

If, for SPARQL, we can do this all in 1 reference formulation, that's great. But, currently it is quite unclear how that would work.

andimou commented 5 months ago

I respectfully agree with all what you say but none of what you say has anything to do with the SPARQL query but with all what comes after the SPARQL query. Either you have a SPARQL query or not, what you say should be defined in a reference formulation, we do not disagree on this. But how we fetch the data from a data source is independent of the reference formulation. If one used a SPARQL query or not to retrieve a set of RDF triples is independent of how one refers to these triples. If one used a SELECT query to retrieve some CSV results, then the reference formulation refers to these CSV results and not to the SPARQL query.

pmaria commented 5 months ago

[ ... ] But how we fetch the data from a data source is independent of the reference formulation.

This is where we are disagreeing. How we are fetching data from a source (the rml:iterator) is most definitely part of the definition of the reference formulation and not independent of it.

andimou commented 5 months ago

The rml:iterator does not specify how we fetch the data of the logical source but how we iterate over the data we retrieved.

How an iteration is computed is indeed part of the reference formulation, but that is something different than how the data is retrieved (the SPARQL query in this case).

What an iteration returns (independently of which reference formulation we use) is a set of key-value pairs that RML can then consider to create the RDF triples of each iteration.

pmaria commented 5 months ago

The rml:iterator does not specify how we fetch the data of the logical source but how we iterate over the data we retrieved.

I must still disagree with this statement.

Take these examples

    rml:referenceFormulation rml:XPath;
    rml:iterator "/xpath/iterator/expression";

     rml:referenceFormulation rml:JSONPath;
     rml:iterator "$.jsonpath.expression";

    rml:iterator "SELECT * FROM student;";
    rml:referenceFormulation rml:SQL2008Query;

    rml:iterator "SELECT { s? ?p ?o } WHERE { ?s ?p ?o. } LIMIT 100";
    rml:referenceFormulation formats:SPARQL_Results_CSV;

All these iterator expressions are specifying what data is to be considered part of the iteration. This includes how we fetch the data. How the results are formed into a logical iteration is not part of the iterator expression, but part of the (currently implicit) rules of the reference formulation. Of course, the iterator must provide an iterable result in accordance with the rules of the reference formulation for a logical iteration to be formed.

There are also logical sources where an iterator is not necessary, since there is a natural way to form a logical iteration on those sources, like with CSV. But again, here, how we iterate is determined by the rules of the reference formulation.

This discussion is an example of why we need more clarity on the definition of the reference formulations. For example, how are logical iterations formed on a JSONPAth expression response? This is currently not specified anywhere.

The same question can be asked for SPARQL CONSTRUCT queries, to bring it back to this issue.

What an iteration returns (independently of which reference formulation we use) is a set of key-value pairs that RML can then consider to create the RDF triples of each iteration.

I do not believe that we can say that an iteration always returns key-value pairs. This is very much dependent on the reference formulation and how reference expression are to be evaluated against the logical iterations in that reference formulation.

I would say an iteration returns a logical iteration where each iteration is a sub-part of the source on which reference expressions can be evaluated to return values from the source data. How this works exactly should also be defined in the reference formulation.

For example, in the rml:JSONPath reference formulation both the iterator and the reference expressions use the JSONPath standard query language. Thus, the logical iteration consists of sub-documents of the JSON source on which subsequent reference expressions in the JSONPath language can be evaluated. However, the rml:SQL2008Query reference formulation uses SQL for the iterator and column names as reference expressions. This could indeed be defined as key-value pairs, but that is probably too simplistic in the case of RDBs.

The point is that every reference formulation needs to define these specific aspects.

tirrolo commented 5 months ago

EDIT: Pano already provided a more detailed answer.

The rml:iterator does not specify how we fetch the data of the logical source but how we iterate over the data we retrieved.

At the moment it looks a bit mixed to me. Consider the following example from the IO spec:

<#RDB> a rml:LogicalSource;
    rml:source [ a rml:Source, d2rq:Database;
        d2rq:jdbcDSN "jdbc:mysrml://localhost/example";
        d2rq:jdbcDriver "com.mysql.jdbc.Driver";
        d2rq:username "user";
        d2rq:password "password";
    ];
    rml:iterator "SELECT * FROM student;";
    rml:referenceFormulation rml:SQL2008Query;

To me, here the iterator is stating how the data is actually fetched. Maybe there is something I misunderstood?

andimou commented 5 months ago

This use if rml:iterator is new to me.. when was that agreed?

pmaria commented 5 months ago

Ah, I think you are referring to rml:query which was used on some rml:Source descriptions. That was this issue: #28

andimou commented 5 months ago

ok I missed this, is it too late that I disagree on this use of the rml:iterator?!

the iterator was meant to indicate how we "traverse" the data, how we iterate over the data we have, it's a pattern that repeats in the data. What pattern can you have if the iterator is a query?

DylanVanAssche commented 5 months ago

the iterator was meant to indicate how we "traverse" the data, how we iterate over the data we have, it's a pattern that repeats in the data.

The way an iterator is currently used is as followed:

Reference formulation explains how to reference to the data
Iterator contains an expression in a language suitable for the data source to iterate over entries in the data source. The language is specified by the reference formulation as an iterator.

This was discussed during a W3C CG meeting and was accepted.

What pattern can you have if the iterator is a query?

For JSONPath/XPath, the iterator has a JSONPath/XPath expression that 'creates' iterations over the document. Each iteration is a JSONPath/XPath result.
For SQL (query, tablename is a shortcut), the iterator has a SQL query expression that 'creates' iterations over the document. Each iteration is a SQL query result.
For CSV, the iterator is not present as it has a default row-based iterator that does the same: it 'creates' iterations over the document as CSV rows. Each iteration is a CSV row.
For SPARQL, a SPARQL query is the iterator that 'creates' iterations over the triples. Each iteration is a SPARQL query result.

pmaria commented 5 months ago

You could argue that the iterator was always a query.

For example an XPATH expression is basically a query on a document. Same goes for JSONPath. So the distinction between iterator and query was always questionable, as is argued in #28.

It is the reference formulation that determines whether the result of the expression/query is iterable, and how it should be iterated.

tirrolo commented 5 months ago

You could argue that the iterator was always a query.

For example an XPATH expression is basically a query on a document. Same goes for JSONPath. So the distinction between iterator and query was always questionable, as is argued in #28.

It is the reference formulation that determines whether the result of the expression/query is iterable, and how it should be iterated.

Which brings us back to the original issue. It is unclear to me how an "iteration" would be performed for CONSTRUCT queries. Probably we will have to define that at some point, though. At the moment, we are in a quite bizarre situation where we accept JSON, CSV, and XML files as input. Not Turtle files, for instance. But can we really justify that? At a first sight it looks a bit arbitrary, and I see this related to the issue of the CONSTRUCT.

Maybe we could argue that for Turtle and other formats we do not really have a reference formulation, that is, a query language that we can use to access file elements. But I am unsure.

andimou commented 5 months ago

You could argue that the iterator was always a query.

this is not 100% correct. An iterator in the case of tables in R2RML is not a query. It just happens that the query language is used as the reference formulation for some cases and that's why the iterator may be a query.

However, having an SQL query as an iterator, that would return a table, this is not an iteration pattern or well, it is, but it has only 1 iteration, the complete table because this table does not repeat within a table.

It is unclear to me how an "iteration" would be performed for CONSTRUCT queries.

this is correct. I think when we proposed RDF as input for the first time, we considered the iteration pattern to be every triple. That was never mentioned anywhere nor specified. If we accept RDF as potential input, then we need a reference formulation to refer to the RDF triples/quads and that reference formulation would also give us the iteration pattern.

pmaria commented 5 months ago

You could argue that the iterator was always a query.

this is not 100% correct. An iterator in the case of tables in R2RML is not a query. It just happens that the query language is used as the reference formulation for some cases and that's why the iterator may be a query.

However, having an SQL query as an iterator, that would return a table, this is not an iteration pattern or well, it is, but it has only 1 iteration, the complete table because this table does not repeat within a table.

It may have not been intended in that way, but I believe the JSONPath and XPath reference formulations were the only defined reference formulations for which the iterator was actually specified in the mappings in earlier versions of RML. In those versions no iterator would have been specified for any type of SQL logical source. So in practice the iterator was always an expressed query for these formulations. Hopefully this clarifies my point.

Another way to look at it: we could replace rml:iterator with rml:query and the mappings would still make sense. Maybe even more so.

kg-construct / rml-io

SPARQL CONSTRUCT support #42