Broken routing for urlencoded paths

tornadoweb / tornado

Tornado is a Python web framework and asynchronous networking library, originally developed at FriendFeed.

http://www.tornadoweb.org/

Apache License 2.0

21.75k stars 5.5k forks source link

Broken routing for urlencoded paths #2548

Open spaceone opened 5 years ago

spaceone commented 5 years ago

self.request.path is the raw urlencoded path from the HTTP start line (i.e. it doesn't even normalize anything and takes whatever the client gives). The router PathMatches uses this to match the path against this.

Having a route definition like:

application = Application([
(r"/(cn=.*)/", MainHandler),
])

fails for /cn%3Dfoo/ and /cn%3dfoo/ but not for /cn=foo/.

A workaround is:

application = Application([
(r"/(cn(?:=|%3D|%3d).*)/", MainHandler),
])

In general I think it's wrong that request.path contains the urlencoded form of the path. It should be decoded. But it's probably an issue for backwards compatibility? At least the routing component could decode the path before applying it to the routes!

bdarnell commented 5 years ago

This is related to #89. The rules for urlencoding are complicated and it's not correct to simply urldecode the whole path. For example, the urls http://example.com/foo/bar and http://example.com/foo%2Fbarare two distinct urls (a relative URLbaz.pngwould resolve tohttp://example.com/foo/baz.pngin the former andhttp://example.com/baz.png` in the latter).

According to RFC 3987 (IRIs), = is a reserved character and should be left in its urlencoded form. The WHATWG URL spec appears to disagree (= must be percent-encoded in a userinfo context but not in a path context).

I agree it's annoying to see percent-encoded values in the routing layer (especially for non-ASCII characters), but the fix is not straightforward so it may be best to avoid using characters like = whose encoding is complex or ambiguous. Or you could implement your own variant of PathMatches that operates on the decoded form if you don't care about the more esoteric special cases that can arise.

spaceone commented 2 years ago

https://datatracker.ietf.org/doc/html/rfc3987#section-5.3.2.3 says that at least the unreserved (ALPHA / DIGIT / "-" / "." / "_" / "~") characters should be decoded when comparing. I think this is not even done by Tornado.

Hm, I think (not sure yet) I would argue that decoding should also be done for the reserved (":" / "/" / "?" / "#" / "[" / "]" / "@" "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=") characters if the URI components are already splitted (which is the case here, as we look only at the path component?!). But this is not explicitly stated in the RFC.

Otherwise a path foo%7Ebar would equal foo~bar but foo!bar would not equal foo%21bar and the equality offoo>bar and foo%3Ebar is not even defined.

bdarnell commented 2 years ago

says that at least the unreserved (ALPHA / DIGIT / "-" / "." / "_" / "~") characters should be decoded when comparing. I think this is not even done by Tornado.

Correct. But it's never mattered because these characters are never encoded in the first place (with the exception of tilde, which I'm pretty sure was considered reserved in older specs).

if the URI components are already splitted (which is the case here, as we look only at the path component?!)

That's an appealing argument, although I think the way that relative URI processing treats "/" and "%2F" differently is a counterexample.