datatype inference - Githubissues

VladimirAlexiev commented 7 months ago

A recent paper:

R2RML and the original RML specification defined that RML processors can perform data type inference from the SQL databases. Thus, mappings did not have to specify rr:datatype for RDF Literals to have the correct data type as the processor would retrieve this automatically from the SQL database. However, RML did not expand this to other heterogeneous datasources such as XML or JSON which both provide data types in different ways: XML schemas, native JSON types, etc. Data type inference is still under discussion but might be moved to RML-IO because this RML module focuses on accessing and iterating over the data source.

and refers to https://github.com/kg-construct/rml-core/issues/87. I don't see much discussion of datatype inference there, so I'm posting this issue here.

Here are a couple of considerations:

For XML, we should clearly use xsd:type, especially focusing on XSD Datatypes but not ignoring custom datatypes like geo:wktLiteral, geo:gmlLiteral etc.
- XML attributes and text content are always strings, so there's no place for implicit types, right?
- One can specialize XSD types using restrictions and extension, which is potentially mappable to rdfs:Datatype constructs, but I think this is clearly beyond scope of RML
- XSD and RelaxNG have the concept of "post schema validation infoset" (PSVI) that can assign application types (eg Person) to elements. However, I don't think we should go there.
for JSON, keep in mind that it does not define what is a number, which leads to a number of unpleasant surprises in JSON-LD. Eg 12345678901234567890 is not a xsd:integer, and small decimals like 12.3 can be treated as float/double (eg 1.23e1) at will. So I'm not sure what can be tested here.

DylanVanAssche commented 7 months ago

and refers to https://github.com/kg-construct/rml-core/issues/87. I don't see much discussion of datatype inference there, so I'm posting this issue here.

In that issue there's a discussion on where the test cases must be as the data type extraction from the data sources like SQL is mentioned in the Core spec while it might be better in the IO spec.

For XML, we should clearly use xsd:type, especially focusing on XSD Datatypes but not ignoring custom datatypes like geo:wktLiteral, geo:gmlLiteral etc.

I agree here. The question is how implementation should extract this given that XML can have separate XSD schemas etc.

for JSON, keep in mind that it does not define what is a number, which leads to a number of unpleasant surprises in JSON-LD. Eg 12345678901234567890 is not a xsd:integer, and small decimals like 12.3 can be treated as float/double (eg 1.23e1) at will. So I'm not sure what can be tested here.

Interesting... I wonder why we cannot indicate a number as double int for integers and doubles for floating point numbers? JSON has a native number type, but maybe it does not differentiates between float/integer here?

VladimirAlexiev commented 7 months ago

@DylanVanAssche Correct: JSON has just "number".

pmaria commented 7 months ago

and refers to kg-construct/rml-core#87. I don't see much discussion of datatype inference there, so I'm posting this issue here.

Yes, the discussion is somewhat hidden, but "natural mapping of values" is definitely being discussed. The proposed plan is to introduce separate documents per reference formulation wherein this can be specified.

See:

Here are a couple of considerations:

For XML, we should clearly use xsd:type, especially focusing on XSD Datatypes but not ignoring custom datatypes like geo:wktLiteral, geo:gmlLiteral etc.

XML attributes and text content are always strings, so there's no place for implicit types, right?

One can specialize XSD types using restrictions and extension, which is potentially mappable to rdfs:Datatype constructs, but I think this is clearly beyond scope of RML

XSD and RelaxNG have the concept of "post schema validation infoset" (PSVI) that can assign application types (eg Person) to elements. However, I don't think we should go there.

for JSON, keep in mind that it does not define what is a number, which leads to a number of unpleasant surprises in JSON-LD. Eg 12345678901234567890 is not a xsd:integer, and small decimals like 12.3 can be treated as float/double (eg 1.23e1) at will. So I'm not sure what can be tested here.

Thanks for this. Once we have specified this it would be great to have some review from you and other experts in the community on this @VladimirAlexiev.

DylanVanAssche commented 4 months ago

@bjdmeest Shouldn't this be moved to rml-io-registry?

kg-construct / rml-io

datatype inference #63