python-hyper / hyper

HTTP/2 for Python.
http://hyper.rtfd.org/en/latest/
MIT License
1.05k stars 191 forks source link

HTTPHeaderMap naively splits on commas #314

Open vfaronov opened 7 years ago

vfaronov commented 7 years ago

hyper.common.headers.canonical_form:

the header is split on commas unless for any reason it's a super-special snowflake (I'm looking at you Set-Cookie)

This seems to be inspired by text from RFC 7230 § 3.2.2, but that permits joining field-values with commas, not splitting them.

This is most readily obvious with the Date header:

import hyper
conn = hyper.HTTPConnection('nghttp2.org:443')
conn.request('GET', '/')
print(conn.get_response().headers['Date'])
# prints: [b'Sat', b'04 Mar 2017 18:23:26 GMT']

But there’s a wealth of other examples:

WWW-Authenticate: Digest realm="test", qop="auth", nonce="12345"
Cache-Control: private, no-cache="Some-Header,Other-Header"
Warning: 299 - "something may be wrong, be careful"

One could do b','.join(headers[name]) to get back to a value that is mostly correct with regard to commas (modulo Set-Cookie and except for lost whitespace). But maybe the API would be better if HTTPHeaderMap.__getitem__ returned a correct value by default, while splitting on commas were an explicit operation.

See also Werkzeug’s header parsing utilities, though they are far from 100% correct as well.

Lukasa commented 7 years ago

Yeah, so we should definitely aim to do better here. Ideally, we'd adjust the API to only canonicalise headers we know that allow a list of values (which is what that section of the specification allows for). I'd be entirely happy to merge a change that restricts the canonicalisation to only the values we know can safely be canonicalised (which will cover all of the internal use-cases we need).

How does that sound?

vfaronov commented 7 years ago

That sounds OK, but there’s a further tradeoff to be made.

Ideally, we'd adjust the API to only canonicalise headers we know that allow a list of values (which is what that section of the specification allows for).

See, Cache-Control and Warning are lists of values alright (as in RFC 7230 § 7) — problem is that the values themselves can contain commas, mostly by way of the RFC 7230 quoted-string.

canonical_form could be made to take quoted-string into account instead of naively splitting on comma, but that may involve more code and/or a performance hit: Werkzeug does this by invoking an undocumented state machine in urllib.

Or the whitelist could be restricted to those headers that really don’t allow commas beyond the top-level list. But note that even something as harmless looking as Transfer-Encoding can theoretically have a value of my-coding; param="foo, bar".

Or decide that it’s OK to break marginal things (like the above my-coding) and simply document how to get the full value if necessary.

This is also a question of how much public API to break. You could omit Cache-Control from the whitelist (there are servers in the wild that send things like the one I showed), but then it breaks for people who depend on headers['cache-control'] giving them nicely split directives.

Lukasa commented 7 years ago

Yeah, we'd probably have to handle the fact that quoted strings exist. As a halfway house we could probably initially only canonicalise when we don't have a quoted string at all, and when we do just leave it alone. How does that sound?

vfaronov commented 7 years ago

only canonicalise when we don't have a quoted string at all

Do you mean “when the given value does not contain quoted strings” or “when the header’s grammar does not allow quoted strings”?

Lukasa commented 7 years ago

When the given value does not contain quoted strings.

vfaronov commented 7 years ago

I guess that would work (in conjunction with a whitelist that excludes things like Date or WWW-Authenticate), but the API would be kinda wonky. I mean, Cache-Control: public, max-age=86400 gets split, but Cache-Control: public, max-age="86400" doesn’t?..

(Please note that I don’t have a stake here. I only raised this issue because it caught my eye.)

Lukasa commented 7 years ago

@vfaronov Yeah, so it's definitely an annoyance, but that's the simplest approach to it that doesn't involve more aggressive checking. For example, a thing we could do is use a regular expression to try to detect "commas inside double quotes", but at that point we should just write a parser and I think by that time we're over-solving this problem. :grin:

persiaAziz-zz commented 7 years ago

any patch available for this one??? :grin:

Lukasa commented 7 years ago

Nope.

ech0-py commented 7 years ago

@persiaAziz you can always make some monkey-patch; for this case I used dict from requests lib:

from requests.structures import CaseInsensitiveDict

response = self.connection.get_response()
response.headers = CaseInsensitiveDict(response.headers.iter_raw())

CaseInsensitiveDict have the same features, but not splits by comas