kg-construct / rml-core

RML-Core: Main features for RDF generation with RML
https://w3id.org/rml/core/spec
Creative Commons Attribution 4.0 International
12 stars 9 forks source link

Templates, safe separators, and null-values #47

Closed pmaria closed 1 year ago

pmaria commented 2 years ago

TL;DR

R2RML defines a safe separator that should be used when a template contains more than one value/column reference. This issue discusses the use case for this, how null values in value references in templates should be handled, and tries to determine how current (R2)RMLprocessors handle this.

Description

In the definition of templates in R2RML the following is stated:

If a template contains multiple pairs of unescaped curly braces, then any pair SHOULD be separated from the next one by a safe separator. This is any character or string that does not occur anywhere in any of the data values of either referenced column; or in the IRI-safe versions of the data values, if the term type is rr:IRI (see note below).

A few things of note here:

The only use case I can imagine for a safe separator is to not get clashes when a referenced value in a template with multiple referenced values is empty.

Given

A B
A ~
A~ NULL

, with a template without a safe separator "http://example.com/{A}{B}" and not returning null for the template when one of its referenced values is null, this would result in

, a (undesired??) clash.

With a template with a safe separator "http://example.com/{A}-{B}" and not returning null for the template when one of its references values is null, this would result in

, no clash.

Of course with or without a safe separator, if a template should be evaluated to null if one of its referenced values is null, in both cases the result for "http://example.com/{A}{B}" would be:

However, safe separators don't actually solve clashes when the same value reference can contain empty strings and NULLs. Given:

A B
A ~
A~ NULL
A~

In the first case you'd get all clashes:

In the second case you'd get one clash:

In the third case (although it probably depends on the processor) you'd get one clash:

Questions

  1. Is there a use case to generate a value for a template when one of its referenced values is null?

    a. If yes, what happens if all referenced values are null?

  2. Do we want to maintain safe separators?

  3. How do current (R2)RML processors handle this? (Since this is not covered in the R2RML test cases)

  4. Is there another use case for safe separators that I'm missing?

Overview template implementation in current processors

Processor Requires safe separator in templates Returns value for templates with one or more null references
CARML NO NO
RMLMapper NO NO
Morph-KGC NO NO
SDM-RDFizer NO NO
Ontop NO NO
R2RML-F NO NO
pmaria commented 2 years ago

Pinging @andimou @dachafra @DylanVanAssche @chrdebru @frmichel @samiscoding @ArenasGuerreroJulian @ghxiao @jatoledo @sumutcan @marioscrock to provide input on the implementation of templates in the processors you're involved in so I can add it to the above overview.

Also if I'm missing anyone please feel free to ping them.

Of course any other input is also welcome ;)

bjdmeest commented 2 years ago

I can confirm RMLMapper has the same behavior as CARML :)

I don't see the point of including the safe separator SHOULD so I'm fine to remove that

However, we have encountered a couple of cases where we wanted to be able to create a value that came from multiple values that weren't all necessarily filled in, so something like "example.com/{possibleID1}{possibleID3}{possibleID3}", where not all IDs are always filled in. We were able to do that by bypassing the rr:template and using grel:array_join instead.

We can probably come up with a suggestion where NULL values are interpreted as 'rightfully NULL' vs 'empty string', then, we can state 'create from a template using possibleID1-3, but interpret their absence as empty string', which would give the same behavior as with using grel:array_join.

However, that has the side-effect that it is possible to create example.com/-nodes when no IDs are filled in. We currently solve that by adding an additional condition to the triplemap to only create it if !((possibleID1 == null) && (possibleID2 == null) && (possibleID3 == null)). I can't come up with a more elegant solution right now

pmaria commented 2 years ago

Thanks!

However, we have encountered a couple of cases where we wanted to be able to create a value that came from multiple values that weren't all necessarily filled in, so something like "example.com/{possibleID1}{possibleID3}{possibleID3}", where not all IDs are always filled in. We were able to do that by bypassing the rr:template and using grel:array_join instead.

Indeed, this is how I've done this as well, using either built-in functions in the reference formulation, or a custom function.

We can probably come up with a suggestion where NULL values are interpreted as 'rightfully NULL' vs 'empty string', then, we can state 'create from a template using possibleID1-3, but interpret their absence as empty string', which would give the same behavior as with using grel:array_join.

However, that has the side-effect that it is possible to create example.com/-nodes when no IDs are filled in. We currently solve that by adding an additional condition to the triplemap to only create it if !((possibleID1 == null) && (possibleID2 == null) && (possibleID3 == null)). I can't come up with a more elegant solution right now

Right, there are different ways template engines handle null values. However a complicating factor here is that we have to support different reference formulations, with their own syntax. So adding new syntax is quite difficult.

IMHO the most sane way to handle this is to always return null whenever one of the references in a template is null, and to leave the substition of nulls to the expression in the reference formulation. But interested in other opinions.

arenas-guerrero-julian commented 2 years ago

Hi,

IMO the R2RML spec is clear on whether to generate triples from NULL values:

A term map is a function that generates an RDF term from a logical table row. The result of that function can be:

Empty – if any of the referenced columns of the term map has a NULL value,

It clearly states ANY, and if the RDF term is not to be generated the triple itself is not to be generated.

I do not see the point of having safe separators (but maybe I am missing something).

Morph-KGC ignores safe separators. As for NULL, if any reference is NULL the triple is not generated. There is also an option na_values to indicate if some values should be interpreted as NULL (e.g., N/A), specially usefull for data files.

samiscoding commented 2 years ago

Pinging @andimou @dachafra @DylanVanAssche @chrdebru @frmichel @samiscoding @ArenasGuerreroJulian @ghxiao @jatoledo @sumutcan @marioscrock to provide input on the implementation of templates in the processors you're involved in so I can add it to the above overview.

Also if I'm missing anyone please feel free to ping them.

Of course any other input is also welcome ;)

SDM-RDFizer has the same behavior as well, i.e., no separator is needed and no triple is generated if at least one of the references to the data in the template is null.

ghxiao commented 2 years ago

Ontop actually behaves the same.

pmaria commented 2 years ago

Hi,

IMO the R2RML spec is clear on whether to generate triples from NULL values:

A term map is a function that generates an RDF term from a logical table row. The result of that function can be:

Empty – if any of the referenced columns of the term map has a NULL value,

It clearly states ANY, and if the RDF term is not to be generated the triple itself is not to be generated.

Ah nice one. Thanks for pointing that out. That together with this sentence make it abundantly clear indeed:

The referenced columns of a template-valued term map is the set of column names enclosed in unescaped curly braces in the template string.

chrdebru commented 2 years ago

R2RML-F does not require safe operators and does not return anything if any of the values is NULL. So my interpretation is in line with the others.

pmaria commented 2 years ago

It is looking like we can drop safe varargs from the template description.

@sumutcan @marioscrock @frmichel reminder for your input, if you find it relevant. I'll close this issue by the end of this week.

dachafra commented 1 year ago

I'll close this issue by the end of this week.

It was one year ago, I think we can close this issue