kg-construct / rml-core

RML-Core: Main features for RDF generation with RML
https://w3id.org/rml/core/spec
Creative Commons Attribution 4.0 International
12 stars 9 forks source link

Removing test 20b, which assumes an RML processor prepends generated terms that have a relative IRI with a (configurable) base IRI #94

Closed chrdebru closed 7 months ago

chrdebru commented 7 months ago

An RML engine should not normalize IRIs. The engine should check whether the generated value is an absolute IRI, or whether the base IRI + value constitutes an absolute IRI.

DylanVanAssche commented 7 months ago

Shouldn't we document this properly in the spec? Like explicitly saying this?

bjdmeest commented 7 months ago

yeah and removing the @base prepending default breaks r2rml conformance, so should at least be discussed

DylanVanAssche commented 7 months ago

break r2rml conformance,

This test-case assumes data errors, which is checked by a R2RML validator. That part is optional in the spec of R2RML and RML. It contradicts a bit with their own spec IMO. Needs discussion for sure.

To obtain an absolute IRI from a relative IRI, the term generation rules of R2RML use simple string concatenation, rather than the more complex algorithm for resolution of relative URIs defined in Section 5.2 of [RFC3986]. This ensures that the original database value can be reconstructed from the generated absolute IRI. ... An R2RML processor MAY include an R2RML data validator, but this is not required.

https://www.w3.org/TR/r2rml/

So in the new specification I'm all for to separate this much better. I agree with @chrdebru that a Processor should not do this, but a Validator should.

Data errors cannot generally be detected by analyzing the table schema of the database, but only by scanning the data in the tables. For large and rapidly changing databases, this can be impractical. Therefore, R2RML processors are allowed to answer queries that do not β€œtouch” a data error, and the behavior of such operations is well-defined. For the same reason, the conformance of R2RML mappings is defined without regard for the presence of data errors.

R2RML data validators can be used to explicitly scan a database for data errors.

A bit further they go into detail about data errors where they distinguish it seems. This should come earlier in the spec IMO.

bjdmeest commented 7 months ago

I don't see the contradiction in R2RML: the engine only outputs elements that don't contain data errors, doesn't that comply with the sentence Therefore, R2RML processors are allowed to answer queries that do not β€œtouch” a data error?

Should be clarified at least, specifically then that data errors in the case of IRI generation implies that the generation of IRI terms which result in non-valid IRIs result in NULL terms. (at least, that's how I interpret the current R2RML spec and test 20b)

Concerning the IRI normalization: this isn't actually normalization, it's 'just' prepending the @base in case of relative IRI generation (the spec specifically mentions that IRI resolution is out of scope). with rml:baseIRI we basically allow to override the default of using @base, but remain compatible with R2RML.

Do we have a specific reason not to allow prepending the @base in case of relative IRI terms @chrdebru ? Or do I misinterpret your PR?

DylanVanAssche commented 7 months ago

I don't see the contradiction in R2RML: the engine only outputs elements that don't contain data errors.

They say that validation is optional (Validator is not included in a Processor), but require Processors then to 'validate'? So if I don't do validation I comply as an implementation of a Processor with the spec, but fail the test-cases as they suddenly require validation to be included.

specifically then that data errors in the case of IRI generation implies that the generation of IRI terms which result in non-valid IRIs result in NULL terms.

Problem remains that validation is optional so you cannot ask Processors to do this. If they fail to build an IRI, they can return NULL, but there will be data errors that still allow IRI generation.

it's 'just' prepending the https://github.com/base in case of relative IRI generation (the spec specifically mentions that IRI resolution is out of scope). with rml:baseIRI we basically allow to override the default of using https://github.com/base, but remain compatible with R2RML.

We change this behavior compared to R2RML in the new spec with rml:baseIRI? I still see it as just taking the base IRI either from @base or from rml:baseIRI and then appending the relative IRI value to generate an absolute one. If that's valid, that's not a concern for the Processor.

bjdmeest commented 7 months ago

We change this behavior compared to R2RML in the new spec with rml:baseIRI? I still see it as just taking the base IRI either from @base or from rml:baseIRI and then appending the relative IRI value to generate an absolute one. If that's valid, that's not a concern for the Processor.

wrt base: we indeed don't change this behavior but extend it. agreement!

They say that validation is optional (Validator is not included in a Processor), but require Processors then to 'validate'? So if I don't do validation I comply as an implementation of a Processor with the spec, but fail the test-cases as they suddenly require validation to be included.

indeed, data validation is optional (ie incoming data) and I agree with that. Whether that also means that IRI validation (ie outgoing data) is optional: I'd like to see a separate issue on that

bjdmeest commented 7 months ago

When providing access to the output dataset, an R2RML processor MUST abort any operation that requires inspecting or returning an RDF term whose generation would give rise to a data error, and report an error to the agent invoking the operation.

chrdebru commented 7 months ago

R2RML Test cases state that the base IRI is assumed to be:

Throughout all test cases, we use the base IRI http://example.com/base/: to turn relative IRIs generated by the Direct Mapping into absolute IRIs, as the required base IRI input to the R2RML processor.

The base IRI is given as input (as stated in the spec); the test cases happen to use the same base IRI for the mapping, which leads to confusing.

An R2RML processor also has access to an execution environment consisting of: A SQL connection to the input database, a base IRI used in resolving relative IRIs produced by the R2RML mapping.

You can foresee cases where the base IRI of the mapping is different from the data. Hence, the proposal of either keeping the base IRI as input and @dachafra's proposal for rml:baseIRI per triples map.

--->

Here is some code of loading a mapping (Turtle with base IRI) in a named graph.

rdfstring = """@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix ex: <http://example.com/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@base <http://example.com/base/> .

<TriplesMap1>
    a rr:TriplesMap;
    rr:logicalTable [ rr:tableName "Student"; ];

    rr:subjectMap [ rr:column "Name"; rr:termType rr:IRI; ];

    rr:predicateObjectMap 
    [
        rr:predicate    rdf:type;
        rr:object       foaf:Person;
    ];
.
"""
g = ds.graph(URIRef('http://example.org/graph/g'))

g.parse(data=rdfstring, format="turtle")
print(ds.serialize(format="trig"))

You "lose" the base IRI when outputting the data

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rr: <http://www.w3.org/ns/r2rml#> .

<http://example.org/graph/g> {
    <http://example.com/base/TriplesMap1> a rr:TriplesMap ;
        rr:logicalTable _:n65c8c8d2687a4c67a1b6119a3f4bb74db1 ;
        rr:predicateObjectMap _:n65c8c8d2687a4c67a1b6119a3f4bb74db3 ;
        rr:subjectMap _:n65c8c8d2687a4c67a1b6119a3f4bb74db2 .

    _:n65c8c8d2687a4c67a1b6119a3f4bb74db1 rr:tableName "Student" .

    _:n65c8c8d2687a4c67a1b6119a3f4bb74db2 rr:column "Name" ;
        rr:termType rr:IRI .

    _:n65c8c8d2687a4c67a1b6119a3f4bb74db3 rr:object foaf:Person ;
        rr:predicate rdf:type .
}

In other words, a mapping loaded into a graph can not be executed anymore if you rely on @base declarations. The base IRI is only used to compute the absolute IRI of relative IRIs within the scope of the current document. You also assume that you must use RDF serialization formats with IRI base declarations for the mappings, and that you process @base across documents.

bjdmeest commented 7 months ago

In summary for this PR: I'm fine with removing 20b for now, but only if we add 2 issues: (i) a clarification for the rml:baseIRI that takes this default behavior of using @base into account (which is an extension of R2RML, so I'm πŸ‘Œ ), and (ii) an issue to discuss the requirement yes/no to only output valid RDF terms (which, if we say no, is in conflict wrt to R2RML, so I'm πŸ‘Ž )

DylanVanAssche commented 7 months ago

@bjdmeest You have a clear view of these issues, would you mind to open them?

chrdebru commented 7 months ago

In summary for this PR: I'm fine with removing 20b for now, but only if we add 2 issues: (i) a clarification for the rml:baseIRI that takes this default behavior of using @base into account (which is an extension of R2RML, so I'm πŸ‘Œ ), and (ii) an issue to discuss the requirement yes/no to only output valid RDF terms (which, if we say no, is in conflict wrt to R2RML, so I'm πŸ‘Ž )

I just provided an example of why you cannot take @base into account. It was unintentionally confusing in R2RML's test cases to use the same base IRI for tests and input.

bjdmeest commented 7 months ago

ah yes, I agree with @chrdebru.

I made a separate issue for https://github.com/kg-construct/rml-core/issues/95

DylanVanAssche commented 7 months ago

Issues added, and we have a plan forward. Merging this.