ruby / uri

URI is a module providing classes to handle Uniform Resource Identifiers
https://ruby.github.io/uri/
Other
86 stars 47 forks source link

Feature Request: Support for IDNA #76

Open HoneyryderChuck opened 1 year ago

HoneyryderChuck commented 1 year ago

Currently, the biggest "missing feature" in stdlib ruby URI/DNS resolution supply chain, is IDNA support. addressable, the OS alternative to stdlib uri, has some support for it, which is, I believe, the main reason why it is a transitive dependency from many other gems (It's other feature, uri templates, is just not as compelling).

This is a proposal for a way to solve this.

punycode

IDNA domains are translated to its punycode representation, in order to be used in DNS queries (which require ascii domains). ruby core stdlib does not have a punycode converter, so this is where it should start IMO. For that, I propose: either a new punycode stdlib gem (bundled?), or its functionality to be available as a submodule of URI in the uri stdlib:

# as a bundled gem
require "punycode"
Punycode.encode("l♥️h.ws") #=> "xn--lh-t0xz926h.ws"
Punycode.decode("xn--lh-t0xz926h.ws") #=> "l♥️h.ws"

# as internal functionality
require "uri/punycode"
URI::Punycode.encode(...

implementation

addressable, as well as other (mostly abandoned) gems, support the IDNA 2003 standard. You'll find both libidn based extensions, as well as pure ruby ports. This has been since superseded by the IDNA 2008 standard (which essentially supports all the more recent unicode versions, plus some edge cases). While I think that a pure ruby implementation should be entertained at some point, I think that at this point, ruby should do best by adopting the most standardized implementation around, and that's libidn2: it's used by most other network libraries, including curl, and distributed as a package for most (all?) OSes supported by ruby.

Integration of libidn2 can be done by either a C extension, or FFI (I'm the maintainer of idnx, which already FFI's into libidn2 and winnls for windows). The advantage of the latter is that it works OOTB for java. The disadvantage may be performance (?), for which a C extension may be a better fit, but then we'd need to know whether java stdlib contains an equivalent of IDNA conversion supporting IDNA 2008.

This means that libidn2 would become a dependency when building ruby. It could be dealt with, however, as an optional dependency, like openssl is: when available, URI::Punycode is defined, and when it isn't, URI::Punycode is not. most ruby installers could then opportunistically install the package as well, just like it's done already with openssl.

(addressable is aware of its lack of IDNA 2008 support, and is working on it by FFI'ing into libidn2 as well).

API

uri could then transparently handle translation internally. I propose that, beyond the proposal made above, nothing else in the public API changes. Instead URI::Generic would support translation OOTB on building objects:

uri = "https://l♥️h.ws"
uri = URI(uri)
uri.host #=> "l♥️h.ws"
uri.hostname #=> "xn--lh-t0xz926h.ws"

# the example above is inspired in how uri already handles IPv6 addresses
uri = URI("https://[::1]")
uri.host #=> "[::1]", cannot be used in Socket.new(host, port)
uri.hostname #=> "::1", can be used in Socket.new(host, port)

This could then be used internally in the resolv library, before issuing the DNS query.

byroot commented 1 year ago

I think Ruby-core would rather avoid depending on more system libraries. It would be preferable to have a pure ruby implementation.

@nobu @hsbt any opinion here?

hsbt commented 1 year ago

I think Ruby-core would rather avoid depending on more system libraries. It would be preferable to have a pure ruby implementation.

Agreed, and we don't have ffi gem on ruby/ruby.

HoneyryderChuck commented 1 year ago

Agreed, and we don't have ffi gem on ruby/ruby.

But there is fiddle, right? Couldn't it be used for the same purpose (minus JRuby support)? Nevertheless, I'd expect it to be a C extension regardless.

I think Ruby-core would rather avoid depending on more system libraries. It would be preferable to have a pure ruby implementation.

I understand the concern, hence why I'd make this an "optional dependency" a la openssl, i.e. it's either there and we have punycode, or we don't and we don't have it. libidn2 should have the same type of platform availability as openssl. And while there is maintenance effort in carrying this dependency forward and one should avoid it at all costs, other initiatives requiring external package dependency are already being considered, so one should balance the tradeoff between maintenance overhead vs. cost of not having the feature.

That being said, I also agree that having a pure ruby punycode implementation would be the best, but there isn't one yet.

byroot commented 1 year ago

it's either there and we have punycode, or we don't and we don't have it.

I know there is precedent for this, but this is very much a last resort thing. When you build Ruby without openssl or libyaml, it's not really "Ruby" given that the vast majority of code out there won't work. So I think we'd rather avoid creating more of these situations.

but there isn't one yet.

https://github.com/knu/ruby-domain_name/blob/c64a59027939aa34e1f5f0efc5cb654d73ccb966/lib/domain_name/punycode.rb could be used as a start.

But before all, I think we'd need a 👍 on what the new API should look like, after that work can be done, the spec isn't trivial, but it's not rocket science either and it's easy to unit test.

HoneyryderChuck commented 1 year ago

The link of the punycode parser you linked is IDNA 2003 compliant, not 2008. I can't evaluate what's the effort in "upgrading" it. But I agree, let's wait for more input.

byroot commented 1 year ago

parser you linked is IDNA 2003 compliant, not 2008.

I know. Hence why I said it could be used as a start.

skryukov commented 1 year ago

Hey, I was trying to find an implementation of IDNA 2008 on pure Ruby for my project, and since I couldn't find anything, I wrote a new gem https://github.com/skryukov/uri-idna 🙃

I was inspired by @HoneyryderChuck's idea to put all IDNA-related functionality inside URI. It would be cool to hear feedback from you :heart:

HoneyryderChuck commented 1 year ago

@skryukov massive effort! thank you for this 💪

I haven't done yet due diligence,but can you confirm that your library tests against standard conformance testing examples (like this one)[http://www.unicode.org/reports/tr46/#Conformance_Testing]?

Besides that, integration with the URI lib would be just a matter of hooking into URI::IDNA.to_ascii(host, uts46_transitional: true) (I guess we'd want transitional mode enabled by default)?

I'll defer to @hsbt for the details of potential integration in the ruby standard library.

skryukov commented 1 year ago

Hey, @HoneyryderChuck

Yup, it conforms to all tests from UTS46 (the spec file is here). I also manually added some tests for IDNA 2008 rules, so if you know of a full IDNA 2008 testing suite, I would love to give it a spin.

I don't mind changing API and/or defaults to better suite current needs, almost all rules are configurable (here is a list of options).

Also note, that the gem might differ a bit from libidn2, for example, libidn2's toUnicode version doesn't validate the result:

idn2 -d xn--fullstop-rm3g.us
full。stop.us
URI::IDNA.to_unicode("xn--fullstop-rm3g.us")
#<URI::IDNA::InvalidCodepointError: Codepoint U+3002 at position 5 of "full。stop" not allowed>
skryukov commented 1 year ago

@hsbt is there anything I can do to help this issue going?