ruby / uri

URI is a module providing classes to handle Uniform Resource Identifiers
https://ruby.github.io/uri/
Other
78 stars 42 forks source link

Update URI::Generic.build/build2 to use RFC3986_PARSER #105

Closed gareth closed 1 month ago

gareth commented 2 months ago

[Description below cross-posted from https://bugs.ruby-lang.org/issues/19266]


In June 2014, uri/common was updated to introduce a RFC3986-compliant parser (URI::RFC3986_PARSER) as an alternative to the previous RFC2396 parser, and common methods like URI() were updated to use that new parser by default. The only methods in common not updated were URI.extract and URI.regexp which are marked as obsolete. (The old parser was kept in the DEFAULT_PARSER constant despite it not being the default for those methods, presumably for backward compatibility.)

However, similar methods called on URI::Generic were never updated to use this new parser. This means that methods like URI::Generic.build fail when given input that succeeds normally, and this also affects subclasses like URI::HTTP:

$ pry -r uri -r uri/common -r uri/generic

[1] pry(main)> URI::Generic.build(host: "underscore_host.example")
URI::InvalidComponentError: bad component(expected host component): underscore_host.example
from /Users/gareth/.asdf/installs/ruby/3.1.3/lib/ruby/3.1.0/uri/generic.rb:591:in `check_host'

[2] pry(main)> URI::HTTP.build(host: "underscore_host.example")
URI::InvalidComponentError: bad component(expected host component): underscore_host.example
from /Users/gareth/.asdf/installs/ruby/3.1.3/lib/ruby/3.1.0/uri/generic.rb:591:in `check_host'

[3] pry(main)> URI("http://underscore_host.example")
=> #<URI::HTTP http://underscore_host.example>

URI::Generic.new allows a configurable parser positional argument to override the class' default parser, but other factory methods like .build don't allow this override.

Arguably this doesn't cause problems because at least in the case above, the URI can be built with the polymorphic constructor, but having the option to build URIs from explicit named parts is useful, and leaving the outdated functionality in the Generic class is ambiguous. It's possible that the whole Generic class and its subclasses aren't intended to be used directly how I'm intending here, but there's nothing I could see that suggested this is the case.

I'm not aware of the entire list of differences between RFC2396 and RFC3986. The relevant difference here is that in RFC2396 an individual segment of a host (domainlabels) could only be alphanum | alphanum *( alphanum | "-" ) alphanum, whereas RFC3986 allows hostnames to include any of ALPHA / DIGIT / "-" / "." / "_" / "~". It's possible that other differences might cause issues for developers, but since this has gone over 9 years without anyone else caring about this, this is definitely not especially urgent.

gareth commented 2 months ago

There are other issues in the Ruby bug tracker such as #19756 where the discussion immediately turns to whether underscores should be allowed in hostnames.

For me that's not the issue here. Ruby has supported underscores in URI hostnames since 2014, just not with the generic build methods, and I can't see a way that this specific difference is justified.

jeremyevans commented 2 months ago

Failures are not your fault. I think this repo needs something like https://github.com/ruby/webrick/commit/6f412123e368c5cc1059127993c793ef44518409 and https://github.com/ruby/webrick/commit/a27d7ed45f630d9ee9a7e8cbd0542542e36a3219

hsbt commented 2 months ago

I fixed CI and rebased this PR.