Non ASCII characters are not allowed in the path

asok commented 2 years ago

Hi, I'm getting such error:

irb(main):001:0> require 'uri'
=> false
irb(main):002:0> URI::HTTPS.build(host: 'example.com', path: '/łódź')
Traceback (most recent call last):
       10: from /Users/asokolnicki/.rubies/ruby-2.6.3/bin/irb:23:in `<main>'
        9: from /Users/asokolnicki/.rubies/ruby-2.6.3/bin/irb:23:in `load'
        8: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
        7: from (irb):2
        6: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/2.6.0/uri/http.rb:62:in `build'
        5: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/2.6.0/uri/generic.rb:137:in `build'
        4: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/2.6.0/uri/generic.rb:137:in `new'
        3: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/2.6.0/uri/generic.rb:193:in `initialize'
        2: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/2.6.0/uri/generic.rb:807:in `path='
        1: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/2.6.0/uri/generic.rb:761:in `check_path'
URI::InvalidComponentError (bad component(expected absolute path component): /łódź)

I thought that the path component is allowed to contain any UTF-8 character.

noraj commented 1 year ago

cf. https://github.com/ruby/webrick/issues/110, especially this comment https://github.com/ruby/webrick/issues/110#issuecomment-1436135222.

This is because URI doesn't support RFC 3987 (Internationalized Resource Identifier (IRI)).

jeremyevans commented 1 year ago

No, a URI path is not allowed to contain arbitrary UTF-8 characters. Non-ASCII UTF-8 characters must be percent encoded, and even some ASCII characters must be percent encoded. It's true that the URI library doesn't support IRIs. That's not a bug, there should probably be a separate library used for IRIs.

noraj commented 1 year ago

IRIs have not been integrated into URIs to keep the retro-compatibility. But IRI is extending URI.

rfc 3987 - section 3

IRIs are meant to replace URIs in identifying resources for protocols, formats, and software components that use a UCS-based character repertoire.

Ruby has a huge Unicode support (in strings, regexp, etc.) so not supporting Unicode in uri module is an exception.

If one does not want to change the behavior of the default parse method, maybe the uri module could include a :unicode / :iri or whatever option to the parse method or an alternative method parse_iri that would accept an IRI and map it to a URI then pass the resulting URI to the classic parse method than handle only ASCII URI. rfc 3987 explains how to map IRI to URI and URI to IRI.

As IRI is extending URI and deeply linked to it I would more see IRI support integrated in new methods in the URI module rather than having a separate module only for URI. But that's just my POV and I may not be the better suited nor more experienced here.

That's not a bug

I agree, that more a feature request to support modern usage where Unicode is widely spread and massively democratized.

mkasberg commented 1 year ago

Just ran into this today... noraj's comments above seem spot-on to me.

ruby / uri

Non ASCII characters are not allowed in the path #40