w3c / csvw

Documents produced by the CSV on the Web Working Group
Other
161 stars 57 forks source link

Feature request: Add support for absolute URIs in CSV to RDF transformation #871

Closed jakubklimek closed 3 years ago

jakubklimek commented 3 years ago

With the current CSV on the Web specification, when I have a CSV column containing absolute URIs I want to use as resource URIs in the resulting RDF, there is no way for me to specify this. The aboutUrl, propertyUrl and valueUrl are defined as relative to the table URL, when used with {reference}. Therefore, my absolute URL from the input CSV is always appended to the URL of the table (and url-encoded).

In RDF::Tabular there seems to be a proprietary extension for this using the {+reference} syntax. Could this be adopted to CSV on the Web?

Sample CSV (in Czech):

číselník,číselník_název_cs,číselník_název_en,položka
https://data.mff.cuni.cz/zdroj/číselníky/sekce,Sekce MFF UK,Schools of FMP CUNI,https://data.mff.cuni.cz/zdroj/číselníky/sekce/položky/informatika

Sample CSVW descriptor:

{
    "@id": "https://data.mff.cuni.cz/soubory/číselníky/sekce.csv-metadata.json",
    "@context": [
        "http://www.w3.org/ns/csvw",
        {
            "@language": "cs"
        }
    ],
    "url": "sekce.csv",
    "tableSchema": {
        "columns": [{
            "name": "ciselnik",
            "titles": "číselník",
            "dc:description": "IRI číselníku",
            "aboutUrl": "{+ciselnik}",
            "propertyUrl": "rdf:type",
            "valueUrl": "skos:ConceptScheme",
            "required": true,
            "datatype": "anyURI"
        }, {
            "name": "ciselnik_nazev_cs",
            "titles": "číselník_název_cs",
            "dc:description": "Název číselníku v češtině",
            "aboutUrl": "{+ciselnik}",
            "propertyUrl": "skos:prefLabel",
            "required": true,
            "datatype": "string",
            "lang": "cs"
        }, {
            "name": "ciselnik_nazev_en",
            "titles": "číselník_název_en",
            "dc:description": "Název číselníku v angličtině",
            "aboutUrl": "{+ciselnik}",
            "propertyUrl": "skos:prefLabel",
            "required": true,
            "datatype": "string",
            "lang": "en"
        }, {
            "name": "polozka",
            "titles": "položka",
            "dc:description": "IRI položky",
            "aboutUrl": "{+polozka}",
            "propertyUrl": "rdf:type",
            "valueUrl": "skos:Concept",
            "required": true,
            "datatype": "anyURI"
        }],
        "primaryKey": "polozka"
    }
}

Expected output RDF (from RDF::Tabular):

<https://data.mff.cuni.cz/zdroj/%C4%8D%C3%ADseln%C3%ADky/sekce> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#ConceptScheme> .
<https://data.mff.cuni.cz/zdroj/%C4%8D%C3%ADseln%C3%ADky/sekce> <http://www.w3.org/2004/02/skos/core#prefLabel> "Sekce MFF UK"@cs .
<https://data.mff.cuni.cz/zdroj/%C4%8D%C3%ADseln%C3%ADky/sekce> <http://www.w3.org/2004/02/skos/core#prefLabel> "Schools of FMP CUNI"@en .
<https://data.mff.cuni.cz/zdroj/%C4%8D%C3%ADseln%C3%ADky/sekce/polo%C5%BEky/informatika> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .

Actual RDF output without {+ref} syntax:

<https://data.mff.cuni.cz/soubory/%C4%8D%C3%ADseln%C3%ADky/https%3A%2F%2Fdata.mff.cuni.cz%2Fzdroj%2F%C4%8D%C3%ADseln%C3%ADky%2Fsekce> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#ConceptScheme> .
<https://data.mff.cuni.cz/soubory/%C4%8D%C3%ADseln%C3%ADky/https%3A%2F%2Fdata.mff.cuni.cz%2Fzdroj%2F%C4%8D%C3%ADseln%C3%ADky%2Fsekce> <http://www.w3.org/2004/02/skos/core#prefLabel> "Schools of FMP CUNI"@en .
<https://data.mff.cuni.cz/soubory/%C4%8D%C3%ADseln%C3%ADky/https%3A%2F%2Fdata.mff.cuni.cz%2Fzdroj%2F%C4%8D%C3%ADseln%C3%ADky%2Fsekce> <http://www.w3.org/2004/02/skos/core#prefLabel> "Sekce MFF UK"@cs .

<https://data.mff.cuni.cz/soubory/%C4%8D%C3%ADseln%C3%ADky/https%3A%2F%2Fdata.mff.cuni.cz%2Fzdroj%2F%C4%8D%C3%ADseln%C3%ADky%2Fsekce%2Fpolo%C5%BEky%2Finformatika> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
gkellogg commented 3 years ago

The reason these are defined as being relative to some other base URI is to invoke the URI/IRI resolution protocol defined in RFC3986. In this case, resolving an absolute IRI to a base IRI results in the original absolute IRI, so the spec does what you want.

This is spelled out in the Normalization section:

  1. If the property is a link property the value is turned into an absolute URL using the base URL and normalized as described in URL Normalization [tabular-data-model].

This description is common in specs that need to deal with relative URIs/IRIs.

jakubklimek commented 3 years ago

Thanks for the explanation. After reading RFC6570 I realized that the {+ref} syntax is actually defined by that RFC, even though it is not mentioned anywhere in the CSV on the Web spec, and it also explains the behavior, where with {ref}, : and / are pct-encoded first, and therefore the result is treated as a relative URI, while with {+ref}, those chars are not pct-encoded, and therefore the result is treated as an absolute URI.

My confusion came from just reading the CSV on the Web spec, and not knowing RFC6570.