kg-construct / rml-io

RML-IO: Input/Output declarations for RML
https://w3id.org/rml/io/spec
Creative Commons Attribution 4.0 International
1 stars 4 forks source link

Separator for CSVs #86

Closed midorna closed 1 month ago

midorna commented 1 month ago

Observation

As far as I understand, we cannot define a field separator for CSV data in RML. Hence, e.g., semicolon separated data fields would not be processed correctly. Side note: YARRRML has a separator definition, but when processed via RML conversion, this information cannot be treated.

Proposal

Define in RML-IO a property rml:separator on rml:LogicalSource.

DylanVanAssche commented 1 month ago

CSVW can specify a delimiter for the CSV records so you can have a semicolon for example as delimiter. Is that what you are looking for?

midorna commented 1 month ago

@DylanVanAssche That would be an option, but not necessarily supported by RML-conformant parsers. Hence, I opt for an integration in the standard. And the field separator is an I/O property, which is missing. As I mentioned, parsers may just ignore information on data sources and targets which is not defined in standard RML. I know that yarrrml-parser adds CSVW dialect information "on-the-fly", but other existing YARRML and RML parsers do not.

DylanVanAssche commented 1 month ago

That would be an option, but not necessarily supported by RML-conformant parsers. Hence, I opt for an integration in the standard. And the field separator is an I/O property, which is missing.

RMLMapper supports CSVW as Source. The idea behind RML-IO is to use as much as possible external vocabularies such as DCAT, CSVW, etc. to handle the access to the data source. We included rml:null for example in the spec because most vocabularies do not define this while it is widely used. It is up to the engine to actually support this, if it does not support it, I suggest you open an issue within the repository of the engine. Which engine(s) do you consider?

As I mentioned, parsers may just ignore information on data sources and targets which is not defined in standard RML. I know that yarrrml-parser adds CSVW dialect information "on-the-fly", but other existing YARRML and RML parsers do not.

If the engine ignores it, it is probably a bug...

midorna commented 1 month ago

That would be an option, but not necessarily supported by RML-conformant parsers. Hence, I opt for an integration in the standard. And the field separator is an I/O property, which is missing.

RMLMapper supports CSVW as Source. The idea behind RML-IO is to use as much as possible external vocabularies such as DCAT, CSVW, etc. to handle the access to the data source. We included rml:null for example in the spec because most vocabularies do not define this while it is widely used. It is up to the engine to actually support this, if it does not support it, I suggest you open an issue within the repository of the engine. Which engine(s) do you consider?

I think reusing vocabulary and mandatory treatment of properties in a mapping engine are different things, but I got the intent. Maybe, the vocabularies to be considered could then be added to the specification.

As I mentioned, parsers may just ignore information on data sources and targets which is not defined in standard RML. I know that yarrrml-parser adds CSVW dialect information "on-the-fly", but other existing YARRML and RML parsers do not.

If the engine ignores it, it is probably a bug...

Regarding engines, I checked YATTER and morph-kgc. I will raise an issue there.

Thanks!

DylanVanAssche commented 1 month ago

Maybe, the vocabularies to be considered could then be added to the specification

We have some examples in the specification already to demonstrate this. The main reason we went for this approach is that we can never include all access descriptions in the specification. New type of sources and targets are created regularly, so we handle this in the spec by providing basic descriptions to align these new sources with RML and leave access to the specialized vocabularies. Note that this is only on how to access the source, not reference formulation or iterator, that is still part of RML-IO.

Regarding engines, I checked YATTER and morph-kgc. I will raise an issue there.

Ah, yeah Morph-KGC does not follow the latest specification nor does it support RML-IO. It only covers RML-Core. The most compliant engines with the latest spec are, in order of compliance:

You can find the compliance results in the Knowledge Graph Construction Challenge of the KGCW: https://ceur-ws.org/Vol-3718/. Participating engines provided a report of their compliance with the new RML specifications.

midorna commented 1 month ago

@DylanVanAssche Thanks a lot for your support and the links! I will try using the recommended engines.

dachafra commented 1 month ago

The discussion was engine dependent, and it continued in https://github.com/morph-kgc/morph-kgc/issues/265