kg-construct / rml-io

RML-IO: Input/Output declarations for RML
https://w3id.org/rml/io/spec
Creative Commons Attribution 4.0 International
1 stars 4 forks source link

Does SPARQL TSV Results make sense? #48

Open chrdebru opened 4 months ago

chrdebru commented 4 months ago

Nor the documentation, nor the test cases provide such examples (same for XML and JSON results, by the way). But I question the usefulness of rml:SPARQL_RESULT_TSV. Taking the example of 0003, we would have the following TSV:

?person ?name   ?age
<http://example.org/0>  "Monica Geller" "33"
<http://example.org/1>  "Rachel Green"  "34"
<http://example.org/2>  "Joey Tribbiani"    "35"
<http://example.org/3>  "Chandler Bing" "36"
<http://example.org/4>  "Ross Geller"   "37"

How should we iterate over those? We cannot treat them as regular TSV. The angle brackets should be removed from IRIs. Literals should be "cast" to their datatypes. And I have no idea what to do with blank node identifiers. Is it possible the group thought that the TSV output would be the same as CSV output, but with tabs?

Same question for JSON and XML representations of SPARQL queries: do they have bespoke iterations (i.e., not the same iterations as for "regular" JSON or XML files), or would iterating over them require a second iterator?

DylanVanAssche commented 4 months ago

I think this is again the same problem that @pmaria mentioned and wanted to 'document' in a Note: https://github.com/kg-construct/rml-core/issues/113

Basically, we would need to properly define a better reference formulation here. formats:SPARQL_Results_TSV defines the format, not how to iterate upon them. We would need something like rml:SPARQLSelectTSV. In the Note we define then what a RML processor should do to iterate over the results:

Same for the others. If you need multiple iterators, it is a RML Fields thing I think, there you can have nested iterators even with mixed data formats like JSON in CSV, etc.

chrdebru commented 4 months ago

I see. I believe we need test cases, as only CSV is covered and CSV boils down to iterating CSV documents. The other formats have quirks. I disagree with the use of BN identifiers. One query can generate _:b1 for a BN, and another from another dataset as well. Reusing these BN identifiers (which refer to different things when they reside in different graphs) would lead to problems. Also, it will become engine-dependent (rdflib, vs. apache jena, vs rdf4j, ...).

SPARQL stipulates that you should at least support CSV and XML (among others); in other words, we could technically limit it to two: one with data type information and CSV for easier processing.

andimou commented 4 months ago

formats:SPARQL_Results_TSV defines the format, not how to iterate upon them

A reference formulation specifies which grammar one can use to access the data of a logical source, not the format. Does rml:SPARQL_RESULT_TSV aim to indicate that the results should be in SPARQL TSV results format or that the data need to be accessed as a SPARQL result of TSV format (whatever that means?)?

How should we iterate over those?

@chrdebru do you want us to include a description of a reference formulation that indicates the iteration pattern to be per row?

We cannot treat them as regular TSV.

Why not?

I have no idea what to do with blank node identifiers

@chrdebru could you please clarify this?

Is it possible the group thought that the TSV output would be the same as CSV output, but with tabs?

Isn't the delimiter possible to be specified as a CSVW description of the result?

Same question for JSON and XML representations of SPARQL queries: do they have bespoke iterations (i.e., not the same iterations as for "regular" JSON or XML files), or would iterating over them require a second iterator?

@chrdebru I do not understand this, what would the first iterator be?

chrdebru commented 4 months ago

I'm using the data follwoing data and query as an example:

<http://example.org/0> <http://xmlns.com/foaf/0.1/age> 33 .
<http://example.org/0> <http://xmlns.com/foaf/0.1/name> "Monica Geller" .
<http://example.org/1> <http://xmlns.com/foaf/0.1/age> 34 .
<http://example.org/1> <http://xmlns.com/foaf/0.1/name> "Rachel Green" .

    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    SELECT ?person (STR(?person) AS ?person2) ?name ?age WHERE {
        ?person foaf:name ?name .
        ?person foaf:age ?age .
    } 

CSV:

person,person2,name,age
http://example.org/1,http://example.org/1,Rachel Green,34
http://example.org/0,http://example.org/0,Monica Geller,33

TSV

?person ?person2    ?name   ?age
<http://example.org/1>  "http://example.org/1"  "Rachel Green"  34
<http://example.org/0>  "http://example.org/0"  "Monica Geller" 33

JSON and XML

<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
  <head>
    <variable name="person"/>
    <variable name="person2"/>
    <variable name="name"/>
    <variable name="age"/>
  </head>
  <results>
    <result>
      <binding name="person">
        <uri>http://example.org/1</uri>
      </binding>
      <binding name="person2">
        <literal>http://example.org/1</literal>
      </binding>
      <binding name="name">
        <literal>Rachel Green</literal>
      </binding>
      <binding name="age">
        <literal datatype="http://www.w3.org/2001/XMLSchema#integer">34</literal>
      </binding>
    </result>
    <result>
      <binding name="person">
        <uri>http://example.org/0</uri>
      </binding>
      <binding name="person2">
        <literal>http://example.org/0</literal>
      </binding>
      <binding name="name">
        <literal>Monica Geller</literal>
      </binding>
      <binding name="age">
        <literal datatype="http://www.w3.org/2001/XMLSchema#integer">33</literal>
      </binding>
    </result>
  </results>
</sparql>

{ "head": {
    "vars": [ "person" , "person2" , "name" , "age" ]
  } ,
  "results": {
    "bindings": [
      { 
        "person": { "type": "uri" , "value": "http://example.org/1" } ,
        "person2": { "type": "literal" , "value": "http://example.org/1" } ,
        "name": { "type": "literal" , "value": "Rachel Green" } ,
        "age": { "type": "literal" , "datatype": "http://www.w3.org/2001/XMLSchema#integer" , "value": "34" }
      } ,
      { 
        "person": { "type": "uri" , "value": "http://example.org/0" } ,
        "person2": { "type": "literal" , "value": "http://example.org/0" } ,
        "name": { "type": "literal" , "value": "Monica Geller" } ,
        "age": { "type": "literal" , "datatype": "http://www.w3.org/2001/XMLSchema#integer" , "value": "33" }
      }
    ]
  }
}
chrdebru commented 4 months ago

formats:SPARQL_Results_TSV defines the format, not how to iterate upon them

A reference formulation specifies which grammar one can use to access the data of a logical source, not the format. Does rml:SPARQL_RESULT_TSV aim to indicate that the results should be in SPARQL TSV results format or that the data need to be accessed as a SPARQL result of TSV format (whatever that means?)?

So we are iterating over solutions then, right? What is then the point of having those formats if we know that all SPARQL implementations must support XML and CSV (at least)? So would rml:referenceFormulation rml:SPARQL_RESULT_SET not be sufficient?

The following are details that are not relevant anymore if the above answer is "yes."

We cannot treat them as regular TSV.

Why not?

TSV representation of SPARQL prescribes how terms are encoded (e.g., the angled brackets). The variables names also have question marks. Should references use name or ?name? The former is used in XML, JSON, and CSV, the latter in TSV. I would find it weird that I should rewrite all references if I change from TSV to CSV.

When you retrieve ?person, do you want to retrieve <http://example.org/1> as a value, or do you want to process the TSV file as a SPARQL Resultset serialization and thus remove the < and > from <http://example.org/1> before returning the value?

I have no idea what to do with blank node identifiers

@chrdebru could you please clarify this?

With CSV, we cannot distinguish between blank node identifiers and literals (same as with IRIs).

person,person2,name,age
b0,b0,Foo Bar,22

Is it possible the group thought that the TSV output would be the same as CSV output, but with tabs?

Isn't the delimiter possible to be specified as a CSVW description of the result?

No. I'm talking about TSV of SPARQL result sets, which must use tabss

Same question for JSON and XML representations of SPARQL queries: do they have bespoke iterations (i.e., not the same iterations as for "regular" JSON or XML files), or would iterating over them require a second iterator?

@chrdebru I do not understand this, what would the first iterator be?

The iterator for SPARQL queries is the SPARQL query. So one iterates over the result set. The problem is that I believe the community thought that CSV result sets can be processed as regular CSV files. This is true, but there are unfortunate corner cases. However, we iterate over solutions in a result set (which are dictionaries), and not over a CSV file. There is much more information in TSV (a more constrained one), JSON and XML (explicit data types, resource types, etc.). TSV uses a different variable naming convention.

For CSV and TSV, the lines correspond with iterations. For XML and JSON, however, the returned JSON and XML documents need a different iterator. E.g., $.results.bindings.[*] and then use person.value to obtain the value. But we cannot provide two such iterators.

As such, I am questioning the added value of rml:SPARQL_RESULT_XXX. We know that SPARQL implementations should at least return CSV and XML. If we use XML, we have everything we need to iterate over the solutions.

The test cases for SPARQL queries are a bit naïve as they only look at CSV without corner cases (e.g., there are no IRIs in the result set).