w3c / csvw

Documents produced by the CSV on the Web Working Group
Other
161 stars 57 forks source link

Feature request: Support for IRIs #872

Closed jakubklimek closed 3 years ago

jakubklimek commented 3 years ago

The current CSVW version and implementations mostly require ASCII-based URIs. In Czechia (and Europe, Asia, etc.) IRIs (with UTF-8 characters) are quite wide-spread. IRIs are also standard in RDF 1.1 (since 2014). It would be nice if we did not have to percent-encode everything (see #871 for examples). In addition, an IRI such as https://data.mff.cuni.cz/zdroj/číselníky/sekce percent-encoded into URI https://data.mff.cuni.cz/zdroj/%C4%8D%C3%ADseln%C3%ADky/sekce is simply a different IRI for RDF implementations.

When I produce a CSV file containing IRIs (from RDF 1.1 data using IRIs) using SPARQL SELECT, I am currently unable to reconstruct it using CSVW, due to the percent encoding and ASCII-only URI issues.

gkellogg commented 3 years ago

The spec is certainly consistent with the use of IRIs, as is [URI Template](), although it doesn't have a normative reference to RFC3987, it does depend on JSON-LD, which is based on RFC3987.

Where are you finding limitations for ASCII-only RFC3986 URIs?

I don't recall why there was no normative reference to RFC3987.

jakubklimek commented 3 years ago

One part of the issue might be coming from this paragraph of RFC 6570:

Although the URI syntax is used for the result, the template string is allowed to contain the broader set of characters that can be found in Internationalized Resource Identifier (IRI) references [RFC3987]. Therefore, a URI Template is also an IRI template, and the result of template processing can be transformed to an IRI by following the process defined in Section 3.2 of [RFC3987].

This means that when I define reference = https://data.mff.cuni.cz/zdroj/číselníky/sekce and use {+reference} URI Template, the result indeed is a URI, i.e. https://data.mff.cuni.cz/zdroj/%C4%8D%C3%ADseln%C3%ADky/sekce. The paragraph says that this URI can be transformed back to an IRI, which is true, however, this is not happening at least in the RDF::Tabular implementation. Therefore, this would require additional post-processing (percent decoding).

This then causes problems because in RDF 1.1, IRIs are tested for equality using simple string comparison, and therefore the percent-encoded and percent-decoded forms are treated as different. Percent encoding IRIs is also recommended to be avoided where not required by the IRI syntax by the RDF 1.1 spec.

Another part of the issue is coming from my attempts to use IRIs in url. I suspect the following might actually be a problem with implementation rather than the spec, however, since I cannot find a functional implementation besides RDF::Tabular, it is hard to say.

Error when IRI is passed from command line:

$ rdf serialize --input-format tabular https://data.mff.cuni.cz/soubory/číselníky/mzdové-třídy.csv-metadata.json
Traceback (most recent call last):
        13: from /usr/local/bin/rdf:23:in `<main>'
        12: from /usr/local/bin/rdf:23:in `load'
        11: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/bin/rdf:13:in `<top (required)>'
        10: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/cli.rb:504:in `exec'
         9: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/cli.rb:645:in `parse'
         8: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/cli.rb:645:in `each'
         7: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/cli.rb:646:in `block in parse'
         6: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/reader.rb:212:in `open'
         5: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/reader.rb:221:in `open'
         4: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/util/file.rb:307:in `open_file'
         3: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/util/file.rb:124:in `open_url'
         2: from /usr/lib/ruby/2.7.0/uri/common.rb:234:in `parse'
         1: from /usr/lib/ruby/2.7.0/uri/rfc3986_parser.rb:73:in `parse'
/usr/lib/ruby/2.7.0/uri/rfc3986_parser.rb:21:in `split': URI must be ascii only "https://data.mff.cuni.cz/soubory/\\u010D\\u00EDseln\\u00EDky/mzdov\\u00E9-t\\u0159\\u00EDdy.csv-metadata.json" (URI::InvalidURIError)

Error when IRI is used in url:

$ rdf serialize --input-format tabular https://data.mff.cuni.cz/soubory/%C4%8D%C3%ADseln%C3%ADky/mzdov%C3%A9-t%C5%99%C3%ADdy.csv-metadata.json
Traceback (most recent call last):
        38: from /usr/local/bin/rdf:23:in `<main>'
        37: from /usr/local/bin/rdf:23:in `load'
        36: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/bin/rdf:13:in `<top (required)>'
        35: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/cli.rb:504:in `exec'
        34: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/cli.rb:645:in `parse'
        33: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/cli.rb:645:in `each'
        32: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/cli.rb:646:in `block in parse'
        31: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/reader.rb:212:in `open'
        30: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/reader.rb:221:in `open'
        29: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/util/file.rb:340:in `open_file'
        28: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/reader.rb:244:in `block in open'
        27: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/reader.rb:244:in `new'
        26: from /var/lib/gems/2.7.0/gems/rdf-tabular-3.1.0/lib/rdf/tabular/reader.rb:67:in `initialize'
        25: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/reader.rb:321:in `initialize'
        24: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/reader.rb:321:in `instance_eval'
        23: from /var/lib/gems/2.7.0/gems/rdf-tabular-3.1.0/lib/rdf/tabular/reader.rb:88:in `block in initialize'
        22: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/util/logger.rb:195:in `log_depth'
        21: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/util/logger.rb:261:in `log_depth'
        20: from /var/lib/gems/2.7.0/gems/rdf-tabular-3.1.0/lib/rdf/tabular/reader.rb:147:in `block (2 levels) in initialize'
        19: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/cli.rb:648:in `block (2 levels) in parse'
        18: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/cli.rb:505:in `block in exec'
        17: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/mixin/mutable.rb:72:in `<<'
        16: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/mixin/writable.rb:29:in `<<'
        15: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/mixin/writable.rb:86:in `insert_reader'
        14: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/mixin/writable.rb:129:in `insert_statements'
        13: from /var/lib/gems/2.7.0/gems/rdf-tabular-3.1.0/lib/rdf/tabular/reader.rb:167:in `each_statement'
        12: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/util/logger.rb:195:in `log_depth'
        11: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/util/logger.rb:261:in `log_depth'
        10: from /var/lib/gems/2.7.0/gems/rdf-tabular-3.1.0/lib/rdf/tabular/reader.rb:198:in `block in each_statement'
         9: from /var/lib/gems/2.7.0/gems/rdf-tabular-3.1.0/lib/rdf/tabular/metadata.rb:1340:in `each_table'
         8: from /var/lib/gems/2.7.0/gems/rdf-tabular-3.1.0/lib/rdf/tabular/metadata.rb:1340:in `each'
         7: from /var/lib/gems/2.7.0/gems/rdf-tabular-3.1.0/lib/rdf/tabular/metadata.rb:1341:in `block in each_table'
         6: from /var/lib/gems/2.7.0/gems/rdf-tabular-3.1.0/lib/rdf/tabular/reader.rb:208:in `block (2 levels) in each_statement'
         5: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/reader.rb:221:in `open'
         4: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/util/file.rb:307:in `open_file'
         3: from /var/lib/gems/2.7.0/gems/rdf-3.1.4/lib/rdf/util/file.rb:124:in `open_url'
         2: from /usr/lib/ruby/2.7.0/uri/common.rb:234:in `parse'
         1: from /usr/lib/ruby/2.7.0/uri/rfc3986_parser.rb:73:in `parse'
/usr/lib/ruby/2.7.0/uri/rfc3986_parser.rb:21:in `split': URI must be ascii only "https://data.mff.cuni.cz/soubory/%C4%8D%C3%ADseln%C3%ADky/mzdov\\u00E9-t\\u0159\\u00EDdy.csv" (URI::InvalidURIError)

And when I want to use just the URI of the CSV file: $ rdf serialize --input-format tabular https://data.mff.cuni.cz/soubory/%C4%8D%C3%ADseln%C3%ADky/mzdov%C3%A9-t%C5%99%C3%ADdy.csv, the .csv-metadata.json file is not detected/used as the result is shaped using the default CSV on the Web algorithm.

Btw. do you know any other CSVW CSV to RDF implementations besides the two passing the tests in the original implementation report?

gkellogg commented 3 years ago

While RDF::Tabular covers the most ground, there is also csvlint, also in Ruby. It's disappointing that the spec hasn't obtained greater adoption, but that may still happen given sufficient community interest.

Regarding the "URI must be ascii only" error, I believe that comes from the Ruby URI standard library, which is invoked when trying to download the reference URL. RDF.rb supports alternative HTTP adaptors, and it's probably just a matter of using one of those. This should probably be taken over at http://github.com/ruby-rdf/rdf-tabular. But, I'll look into it further; this is an implementation issue and not a specification issue, at least so far.

gkellogg commented 3 years ago

Actually, I found a fix entirely within RDF.rb. If you do a gem update rdf, you should get version 3.1.5 of the "rdf" gem (RDF.rb), which allows you to use the 'rdf' command to parse your metadata and related CSV files with full UTF-8 URLs.

jakubklimek commented 3 years ago

@gkellogg Thanks, this indeed helped with the "URI must be ascii only" part. Any thoughts on the percent-decoding part?

gkellogg commented 3 years ago

You shouldn't need to percent encode your URLs with the update I made. If you natively have percent-encoded URLs, then those should be used as is, as they are certainly legal URLs.

If the result of using {+value} on an unencoded URL results in an encoded URL, then that is an issue with the addressable gem, which implements the URI template logic. If that's what your seeing, then the solution is probably to URI encode before using the template, and then URI decode the result.

If you could give me a small test case, that would help.

jakubklimek commented 3 years ago

@gkellogg I am still working with rdf serialize --input-format tabular https://data.mff.cuni.cz/soubory/číselníky/mzdové-třídy.csv-metadata.json. I am using {+ciselnik}. As you can see in https://data.mff.cuni.cz/soubory/číselníky/mzdové-třídy.csv, there is unencoded https://data.mff.cuni.cz/zdroj/číselníky/mzdové-třídy in the first column. In the result of the command, I see https://data.mff.cuni.cz/zdroj/%C4%8D%C3%ADseln%C3%ADky/mzdov%C3%A9-t%C5%99%C3%ADdy - the percent encoded URL. I could percent decode the URLs after, but it seems to me that outputting the IRIs percent encoded goes against the recommendation to avoid it.

gkellogg commented 3 years ago

So, without rewriting the [Addressable::Template]() library, the best I can do is to add an option to %-decode the result of all template operations. This is not a generic fix, as it is reasonable for URLs to contain percent sequences, which should not be decoded. But, if you know the data you're using, decoding all template URLs should do the job for you.

I created an issue on Addressable, which may get some attention.

gkellogg commented 3 years ago

Do a gem update of rdf-tabular to 3.1.1 and try the command using the --decode-uri option.

jakubklimek commented 3 years ago

@gkellogg Thanks, that works for me for now.

To provide a little bit of context, we are trying to push CSVW as an alternative to JSON-LD in data specifications for those, who are not comfortable publishing or consuming JSON files.

gkellogg commented 3 years ago

Do a gem update of rdf-tabular to 3.1.1 and try the command using the --decode-uri option.

I actually think that the --decode-uri option should probably not be necessary, and it is safe to simply decode all output from URI Templates, as any existing %-sequences would have been doubly encoded as part of that process.

The test suite should be updated, at some point, with further I18N use cases such as this.