httprb / http

HTTP (The Gem! a.k.a. http.rb) - a fast Ruby HTTP client with a chainable API, streaming support, and timeouts
MIT License
3.01k stars 321 forks source link

Strip brackets from IPv6 addresses before passing them to a socket. #731

Closed jeraki closed 1 year ago

jeraki commented 1 year ago

Fixes #730.

I took my best guess at how you'd want this change implemented and tested, but I'm happy to revise it based on feedback. Thanks!

tarcieri commented 1 year ago

A few thoughts on this:

  1. It'd be good to find the spec for these sorts of bracketed addresses in URIs, so we know we're implementing them correctly. I've never seen them before and I'm not sure what the behavior is supposed to be (e.g. are they IPv6 only or do they also work with IPv4? Do DNS names work?)
  2. The regex should match what the spec says about what's allowed inside the brackets, i.e. the inner capture group should be for e.g. an IPv6 address (or whatever else is allowed), and not just "anything"
  3. It'd be good to do this handling at the time the URI is processed (i.e. as early as possible) so we don't have bracketed host strings flying around that need to be stripped in various places. Perhaps inside of the constructor for lib/http/uri.rb?
tarcieri commented 1 year ago

Apparently the bracketed hosts are defined in RFC3896 § 3.2.2:

A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This is the only place where square bracket characters are allowed in the URI syntax.

So only IPv6 addresses should be allowed.

Here is a regex for IPv6 addresses:

https://gist.github.com/syzdek/6086792

(
([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|          # 1:2:3:4:5:6:7:8
([0-9a-fA-F]{1,4}:){1,7}:|                         # 1::                              1:2:3:4:5:6:7::
([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|         # 1::8             1:2:3:4:5:6::8  1:2:3:4:5:6::8
([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|  # 1::7:8           1:2:3:4:5::7:8  1:2:3:4:5::8
([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|  # 1::6:7:8         1:2:3:4::6:7:8  1:2:3:4::8
([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|  # 1::5:6:7:8       1:2:3::5:6:7:8  1:2:3::8
([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|  # 1::4:5:6:7:8     1:2::4:5:6:7:8  1:2::8
[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|       # 1::3:4:5:6:7:8   1::3:4:5:6:7:8  1::8  
:((:[0-9a-fA-F]{1,4}){1,7}|:)|                     # ::2:3:4:5:6:7:8  ::2:3:4:5:6:7:8 ::8       ::     
fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|     # fe80::7:8%eth0   fe80::7:8%1     (link-local IPv6 addresses with zone index)
::(ffff(:0{1,4}){0,1}:){0,1}
((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}
(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|          # ::255.255.255.255   ::ffff:255.255.255.255  ::ffff:0:255.255.255.255  (IPv4-mapped IPv6 addresses and IPv4-translated addresses)
([0-9a-fA-F]{1,4}:){1,4}:
((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}
(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])           # 2001:db8:3:4::192.0.2.33  64:ff9b::192.0.2.33 (IPv4-Embedded IPv6 Address)
)

Alternatively you could speculatively attempt to parse the captured inner address as an IPAddr and verify it's an IPv6 address.

jeraki commented 1 year ago

Thank you for the feedback and the useful references. I like the idea of moving this into HTTP::URI rather than needing to remember to strip the brackets at the point of use.

An open question for me is whether the bracket-stripped form should be part of HTTP::URI itself or whether it should be part of the underlying Addressable::URI. Consider that Addressable::URI expects the brackets for an IPv6 address:

bracketed = Addressable::URI.parse("http://[2606:2800:220:1:248:1893:25c8:1946]/foo?bar=baz")
bracketed.host # => "[2606:2800:220:1:248:1893:25c8:1946]"
bracketed.port # => nil

unbracketed = Addressable::URI.parse("http://2606:2800:220:1:248:1893:25c8:1946/foo?bar=baz")
unbracketed.host # => "2606:2800:220:1:248:1893:25c8"
unbracketed.port # => 1946

Should we maintain the bracketed host in the underlying Addressable::URI and have the host method on HTTP::URI strip the brackets, or should we mutate the Addressable::URI after construction to remove the brackets? I'm not sure if mutating the Addressable::URI to something it didn't parse as would violate some intended invariant.

I notice that currently host-related methods are just delegated to the Addressable::URI with def_delegators :@uri, :host, :normalized_host, :host=, so presumably we'd define real methods for the two getters that do the stripping rather than delegating directly. Unclear if the setter should be changed in any way.

tarcieri commented 1 year ago

@jeraki I would suggest changing HTTP::URI to eagerly strip the brackets from IPv6 addresses and caching it as a @host variable which the accessor methods can be changed to use

jeraki commented 1 year ago

Using IPAddr looks cleaner than using our own regex. A nice property of it is that if you call to_s on an IPv6 IPAddr it strips the brackets for you:

ip_bracketed = IPAddr.new("[2606:2800:220:1:248:1893:25c8:1946]")
ip_bracketed.to_s # => "2606:2800:220:1:248:1893:25c8:1946"

ip_unbracketed = IPAddr.new("2606:2800:220:1:248:1893:25c8:1946")
ip_unbracketed.to_s # => "2606:2800:220:1:248:1893:25c8:1946"

So the logic for host could be:

def host
  begin
    ip = IPAddr.new(uri.host)
    return ip.to_s if ip.ipv6?
  rescue IPAddr::Error
  end
  uri.host
end

Also,Addressable::URI#to_s is used in HTTP::URI's to_s, to_str, and inspect methods, which don't strip brackets. I'm not sure what these are used for in practice, but given they're public methods it'd be a breaking change to do bracket-stripping in these methods. Maybe they don't need to be modified to match the bracket-stripping behavior of a new host method because there isn't a use case for them to change (and it wouldn't be necessary to fix #730), but it's unsatisfying to have inconsistency in how the host is represented between these methods.

Come to think of it, host is also already a public method via delegation, so changing host to strip brackets is also a breaking change. Should the bracket-stripping version be exposed as a new method with a different name?

tarcieri commented 1 year ago

I think it's fine to change the behavior. The current behavior is broken per #730, so changing it is a bugfix.

As to how to approach the setter and whether it should mutate the underlying Addressable::URI is a tricky question. I guess the setter should add brackets to the host being set on the underlying Addressable::URI if host= receives a valid IPv6 address? That is to say, it needs to update both @host and the inner URI, and the inner URI needs to be serialized with brackets.

jeraki commented 1 year ago

Okay, I pushed up a new commit that does the stripping via HTTP::URI as we discussed. Let me know what you think.

RuboCop was reporting a violation because the top-level describe block in the spec has too many lines, but that seems like a useless metric in this case, so I added it to the exclusions along with all the other specs. It's probably worth just turning that rule off for the whole spec directory.

jeraki commented 1 year ago

The only thing I'm still unsure about is whether to add logic to HTTP::URI#host= to make sure IPv6 addresses have brackets in the underlying Addressable::URI. In either case, the brackets will be stripped when accessing calling HTTP::URI#host or HTTP::URI#normalized_host, I'm just a little wary about unexpected effects from changing the internal representation. Addressable::URI.parse will only recognize an IPv6 address correctly if it's bracketed, and calling Addressable::URI#host on a value constructed this way will show the brackets in the string representation. But if you mutate it after the fact with Addressable::URI#host=, it's perfectly happy to accept an IPv6 without brackets, and subsequent calls to Addressable::URI#host will show it without the brackets, but that form couldn't have been instantiated correctly by the parse constructor.

tarcieri commented 1 year ago

@jeraki I mentioned that particular issue earlier.

I suggested the setters need to use the bracketed form of IPv6 addresses with Addressable::URI so the URLs serialize correctly, while internally caching the host with the brackets stripped.

That's another thing I think would be easier with eagerly updating instance variables, rather than trying to do it lazily.

I think if you have one private helper method, called from both the constructor and the setter, which extracts the instance variables out of Addressable::URI, you can have fairly clean logic for this, and also DRY out the logic for checking if an address is an IPv6 address.

jeraki commented 1 year ago

Sorry for the delay. I pushed a commit to process host and normalized_host eagerly and to ensure the underlying Addressable::URI has brackets when host= is used. Let me know if you are okay with my implementation or if there's something you'd like changed. Thank you!

tarcieri commented 1 year ago

Looks good now, thanks!

jeraki commented 1 year ago

Thanks for your guidance! When you get a chance, could you publish a new gem that includes this change?

tarcieri commented 1 year ago

Sure

cben commented 1 year ago

Conceptually, it's weird having brackets in host. They are not part of an IPv6 address by itself; the brackets are merely part of the URL syntax for IPv6 (needed to distinguish the colons inside the address from :port number).

However, the Addressable gem makes the distinction that #hostname is the address by itself vs. #host is the bracketed-if-IPv6 part of the #authority portion of the URL.
With that context, this is consistent :+1:.
(But if you can switch to dealing with hostname (?) it might simplify things)

tarcieri commented 1 year ago

@cben that sounds like a great point. Could one of you open a PR to switch to revert this and switch to #hostname?

Can't believe there are 3 different methods to get the hostname