Enhance validation and generation error description in the spec

bjdmeest commented 6 months ago

R2RML states

When providing access to the output dataset, an R2RML processor MUST abort any operation that requires inspecting or returning an RDF term whose generation would give rise to a data error, and report an error to the agent invoking the operation.

but also

Data errors cannot generally be detected by analyzing the table schema of the database, but only by scanning the data in the tables. For large and rapidly changing databases, this can be impractical. Therefore, R2RML processors are allowed to answer queries that do not “touch” a data error, and the behavior of such operations is well-defined. For the same reason, the conformance of R2RML mappings is defined without regard for the presence of data errors.

So, data validation (ie incoming data) is optional for [R2]RML engines and I agree with that.

Remaining question is: does that means that IRI validation (ie outgoing data) is optional? If it's not, every generated term should be validated (so increased processing time) to make sure the engine does not output invalid RDF If it is, and RML engine (whose sole purpose is to generate RDF) could generate invalid RDF.

I vote for 'IRI validation MUST be part of the functionality of an RML engine', ~~but that doesn't mean that individual engines cannot support 'lenient' modes to generate RDF-like strings faster, but without validation~~ (I'm removing this last comment as it's out of scope).

chrdebru commented 6 months ago

There are two things here. First, the second part, "R2RML processors are allowed to answer queries," seems to imply the virtual KG case where R2RML is used to rewrite SPARQL queries into SQL. RML has decided to "drop" the inverse expression clause as virtualization is a different "problem."

Secondly, I agree that IRI validation must be part of the functionality. And, I interpret R2RML as being "data validation is optional" (see data type validation, for example), but they are strict when it comes to absolute IRIs. I would even go as far as to take the paragraph from R2RML:

The term generation rules, applied to a value, are as follows: 1 If value is NULL, then no RDF term is generated. 2 Otherwise, if the term map's term type is rr:IRI: 2.1Let value be the natural RDF lexical form corresponding to value. >2.2If value is a valid absolute IRI [RFC3987], then return an IRI generated from value. 2.3Otherwise, prepend value with the base IRI. If the result is a valid absolute IRI [RFC3987], then return an IRI generated from the result. 2.4 Otherwise, raise a data error. [...]

DylanVanAssche commented 6 months ago

I would not raise a data error but rather a generation error. This is my main problem with how R2RML handles this as their validation for data errors is optional but enforce IRI validation. Then we can easily distinct between both cases and it becomes way more clear that validation of the data itself is optional, but the generated output must honor valid IRIs, RDF compliance, etc.

bjdmeest commented 6 months ago

In total agreement, except that I vote for extending 2.4 to either (i) raise a generation error (no output aka strict mode), or (ii) return NULL for invalid IRIs and continue (partial output aka lenient mode)

I'm not saying this needs to be fixed by yesterday though, there are more error-related issues that we can maybe all tackle together in a next iteration

DylanVanAssche commented 6 months ago

In total agreement, except that I vote for extending 2.4 to either (i) raise a generation error (no output aka strict mode), or (ii) return NULL for invalid IRIs and continue (partial output aka lenient mode)

Oh yes 100% regarding strict or not.

I'm not saying this needs to be fixed by yesterday though, there are more error-related issues that we can maybe all tackle together in a next iteration

Exactly! This kind of discussions is what I wanted to spark by including the test-cases into the KGC Challenge. This way we finally implement something to see if it actually works and come across these kind of details. Thanks for tackling them together you all!

chrdebru commented 6 months ago

Technically, an easy fix would be to use error codes:

1 = generation error
2 = validation error

You can then have engines that still produce some RDF, but you know that when the error is 1, you can "ignore" or "discard" the output. When it is 2, you can still decide whether to use the data or not.

DylanVanAssche commented 6 months ago

Speaking of error codes, maybe also one for an invalid input parameters e.g., mapping file, CLI args, etc. ?

dachafra commented 1 month ago

I would suggest to have this in mind for working-group but not incorporating it now in the spec. @bjdmeest @DylanVanAssche @chrdebru?

DylanVanAssche commented 1 month ago

+1 for WG

kg-construct / rml-core

Enhance validation and generation error description in the spec #95