Discussion: Capitalize the header names of outgoing headers.

futursolo commented 7 years ago

Although, as per RFC 2616, All the HTTP/1.1 headers should be case-insensitive. However, as a general practice, most of the major web servers have their headers capitalized. Also, some broken HTTP/1.1 implementations do not recognize some specific headers(e.g.: Server header for Apache Benchmark) if they are not properly capitalized. It would be better if the header can be capitalized.

Performance Concerns: Compared to str.lower(), str.title() is slower. However, this can be migrated by adding an LRU Cache into the header capitalization process to solve the problem.

Lukasa commented 7 years ago

I strongly recommend making it possible to emit headers with any capitalisation, while making the current behaviour the default.

The reality is that HTTP implementations that require title cased headers are wrong. Their inability to tolerate them, forcing other implementations to psychically tolerate their preferences, is a bad behaviour and should be punished. However, it should clearly be possible to overrule that behaviour if needed to make an application work.

Note that the "capitalise as a matter of course" servers and clients are on the way out. HTTP/2 forcibly lowercases header names, which means a lot of these bad implementations get fixed as they get support for HTTP/2. So this may be a problem that reduces in scope over time.

A final note: RFC 2616 is dead. Long live RFCs 723[0-3]!

sigmavirus24 commented 7 years ago

ITYM 723[0-5]

njsmith commented 7 years ago

I doubt there's any real performance difference either way between titlecasing and lowercasing, at least at the individual string level:

~$ pyperf timeit 'b"content-length".lower()'                            
.....................
Median +- std dev: 67.3 ns +- 2.3 ns
~$ pyperf timeit 'b"content-length".title()'
.....................
Median +- std dev: 75.2 ns +- 4.0 ns

The difference here is far smaller than the overhead of any caching scheme we could come up with.

In general I'd strongly prefer not to have a configuration knob here if at all possible, due to the costs to devs and users of maintaining, testing, documenting the thing...

From some research, it looks like title-casing does not solve all these problems. Searching finds a number of examples of implementations that require headers cased like dppPersonUUID, Sec-WebSocket-Key (notice the S in WebSocket), WWW-Authenticate (with Www-Authenticate being rejected), PageSpeed, X-Parse-OS-Version, ...

OTOH, it may solve enough of them that we can "get away with it". This appears to be golang's approach. (I'm not 100% sure that title-casing works better than lowercasing on average, but it seems likely. AFAICT nodejs switched from lowercase to title-casing.)

Some options:

Do nothing and wait for someone to run into this for real. That list of issues linked above suggests that someone probably will run into it eventually, but like @Lukasa notes, it's possible that in 2017 the issue is not as severe as it used to be and is getting better. Maybe we can get away with it.
Continue to unconditionally use lower-case internally; switch to unconditionally using title-case when writing out headers to the wire. This is a two-line change, and seems to incur a ~5% slowdown on CPython and ~3% on pypy. (OTOH the header validation stuff incurred a ~20% slowdown and we merged it anyway.) Probably we can get away with it, given golang's example.
Switch to always using title-case as our canonical casing everywhere, internally and externally. This would be relatively unintrusive in a sense, b/c it keeps the same architecture as now, but it would force library users to adapt to dealing with title-case in our public APIs, which is pretty annoying, esp. for users who want to do both HTTP/1.1 and HTTP/2. I can already see @Lukasa gearing up to write a cranky email about this and he's probably correct :-)
Canonicalize incoming headers, but don't touch outgoing headers. This is pretty tricky to get right though because there are a number of places where we need to internally examine and manipulate headers on the outgoing path, and all of those would need to re-do the canonicalization by hand (so e.g. right now we have code like for name, value in headers: if name == b"content-length": ..., and it would become for name, value in headers: if name.lower() == b"content-length": ...). So despite what one might expect this would almost certainly be slower than the current design, and with this approach there's a disturbingly high probability of us becoming one of those case-sensitive implementations because we forgot a .lower somewhere. OTOH this does provide the maximally flexible option for interoperability.

futursolo commented 7 years ago

@Lukasa I totally agree with your point of view. I also don't want to be tolerant to these broken implementations. Maybe, as time goes by, these implementations will die out. However, for now, these implementations still have some market share.

@njsmith Tornado has a cache for capitalized headers. I wrote a simple benchmark.

import time                                                                       
import tornado.httputil                                                           

start_time = time.time()                                                          
for _ in range(0, 100000):                                                        
    "content-length".title()                                                      

print("{:.5f}".format(time.time() - start_time))                                  

normalized_headers = tornado.httputil._normalized_headers                         

start_time = time.time()                                                          

for _ in range(0, 100000):  # Thanks to @njsmith for stating out this error.
    normalized_headers["content-length"]                                          

print("{:.5f}".format(time.time() - start_time))

When running the code above, my computer outputs 0.34973 and 0.00081.

I also suggest to keep using lowercase internally in favour of HTTP/2, however, add a mechanism at the output to justify the header names.

futursolo commented 7 years ago

Also for the special cases of capitalization in http headers(e.g.:ETag), I suggest to include a predefined, capitalized mapping for common http/1.1 headers for both performance purposes and solving these edge cases.

njsmith commented 7 years ago

Your benchmark has a typo -- you're only running the tornado test 2 times, instead of 100,000 times. After fixing that I get 0.02480 and 0.03074. I'm a bit confused at how your system is apparently >10x slower than my 2 year old laptop, but the relative speeds make more sense this way :-)

futursolo commented 7 years ago

@njsmith sorry, that's my fault. After fixing this problem, the result comes to 0.02667 and 0.01709.

In addition, the slowness comes from the console output, cause I first ran it in the Python interactive command line without suppressing the output(HOW LAZY I AM!). The result above is run after I fix this two problems.

However, my results contradicts yours. After I ran it for multiple times, I still cannot get your results.

My platform: CPython 3.6.0 on macOS 10.12.

njsmith commented 7 years ago

My numbers on your benchmark must have been a one-off glitch -- running it again now I get numbers more similar to yours.

However, the reason I didn't try repeatedly is that I'm still right ;-). Which is to say, it made sense to me, because the numbers matched what I was getting with timeit (which generally gives more meaningful results anyway). It turns out the reason is that I was checking bytes.title, while your benchmark uses str.title. bytes.title is more relevant here, and it's ~3x faster than str.title:

~$ pyperf timeit '"content-length".title()'
.....................
Median +- std dev: 223 ns +- 14 ns
~$ pyperf timeit 'b"content-length".title()'
.....................
Median +- std dev: 75.6 ns +- 1.5 ns

It's hard to compare against tornado.httputil directly, because they require str as input. For a simple dict lookup in a single-element dict, I get ~50 ns, regardless of whether the key is bytes or str, and for _normalized_headers I get ~125 ns. (AFAICT _normalized_headers should be just as fast as a dict... not sure where the discrepancy comes from!)

njsmith commented 7 years ago

@futursolo: also just to be clear... is this something you're actually running into now, or more of a theoretical concern?

futursolo commented 7 years ago

Actually, I am writing a toy server using h11 and asyncio.

Just curious about the design of the implementation, no offence.

njsmith commented 7 years ago

No offense taken! Just trying to figure out how urgent this is :-)

Lukasa commented 7 years ago

So, here's an alternative for you @njsmith: write a data structure that is basically a tuple, but allows you to access headers either case sensitively or case insensitively (where case insensitively is the default and is based on a call to .lower()).

The goal here is to allow all of h11's internals to continue to operate with confidence knowing that they'll see headers regardless of their case, but it also allows h11 to preserve any case that is provided to it for the sake of users who want that information.

Requests does a similar thing with its CaseInsensitiveDict, which is indexed by lowercased header names but has values that are a tuple of (original_name_case, value). That allows it to preserve header case when exposing to users, while nonetheless allowing for case-insensitive indexing. A similar thing could be done with a tuple-ish data structure if you were so inclined.

njsmith commented 7 years ago

Here's a really simple (maybe too simple) idea to throw into the mix: use tuples of (header, header_with_original_casing, value)

mhils commented 7 years ago

We are looking at implementing @mitmproxy's proxy core using sans-io and h11 would be a natural fit for that. I did some prototyping to test how h11 would work with our planned architecture. I'm really happy with how that turned out, yet header normalization/capitalization would definitely be blocker for us: for forensic reasons, we need to replay every non-malformed request as closely as possible to the upstream server.

It would be fantastic if h11 could end up with an implementation that follows @Lukasa's last proposal or @njsmith's (header, header_with_original_casing, value) idea. I do like the latter for being fairly low-level. FWIW, we would convert headers into mitmproxy's Header class anyway to have parity with h2.

njsmith commented 7 years ago

For the mitmproxy case, see also #47

njsmith commented 5 years ago

Update: @bradwood is our lucky winner in the non-compliant-header-processing lottery! Well... not so lucky, really. But he's trying to speak websocket to a server that's built-in to his Sky television box, and it's not working. He's using trio-websocket, which uses wsproto, which uses h11. And after some debugging, it turns out that the problem is that the television box's server only recognizes Upgrade: websocket, and not upgrade: websocket.

Interestingly, it does treat the Sec-WebSocket-Key and Sec-WebSocket-Version headers case-insensitively. So unconditionally title-casing outgoing headers would be enough to make this particular server happy. Which leaves the question of whether there are other servers out there that that do insist on the WebSocket casing, because for them, title-casing won't be sufficient...

While tracking this down, I found this interesting code/comment in the Go websocket library that Bradley was using for testing:

https://github.com/gorilla/websocket/blob/76e4896901ef89d253d69b528d3701dd146fe2ae/client.go#L196-L203

It looks like that code+comment was added in https://github.com/gorilla/websocket/commit/5ed2f4547d7b9c036541b27a67ecf54e41756997, which also switched from writing out the request manually to using the net/http library. The old code that wrote the headers was:

https://github.com/gorilla/websocket/blob/423912737d2c26382ab4b2647d796811d4efa00b/client.go#L46-L48

So... it seems possible that they've only encountered server's like @bradwood's, that are picky about Upgrade but not about WebSocket? The only reason they would actually need to use the special no-case-canonicalization API though is to keep the casing on WebSocket – net/http by default uses title-casing, so Upgrade would be capitalized in either case. Crystal ball is unclear.

It looks like the WHATWG fetch spec has switched to only presenting lowercase to JS code, but internally preserves whatever case was passed in and uses it on the wire: https://github.com/whatwg/fetch/pull/476

Cross-references: https://github.com/HyperionGray/trio-websocket/issues/72 , https://gitter.im/python-trio/general?at=5bd767733844923661ba27d9

bradwood commented 5 years ago

Interestingly the websocket RFC itself consistently uses Upgrade: and websocket with that exact case, so a naive server implementation might just assume that following that convention without reference to RFC2616 would work.

njsmith commented 5 years ago

@bradwood While I'm thinking, here's a hacky (and untested) monkeypatch that should hopefully tide you over: https://gist.github.com/njsmith/2bd3e54649379ecadf8a699c7bb8bcb3

So here's what I'm thinking so far.

The simplest approach is to title-case outgoing headers, like Go does (when people don't do hacks like that code in gorilla/websocket). That's easy to do without touching h11's public API at all, and it handles every case that's been reported so far. But it might not handle some cases in the future.

We could provide some way for people to manually request case-preservation for a specific header, e.g., pass in a custom CIBytes object or something. (Again, like gorilla/websocket does to work around Go's normal behavior.) The problem with this though is that the people who need it are probably going to be several abstraction layers away from h11 (like @bradwood is).

So, how hard would it be to do case preservation?

When we ingest headers, we normalize the case (in h11._headers.normalize_and_validate). We could stash the original header cases in some kind of lookaside table, like a dict or something. I was having some trouble figuring out where to keep this dict though (I guess on the event object? we'd have to refactor stuff a bit to make this available in the right places...). Just allocating the dict is also surprisingly expensive...

Another option would be to not normalize case in headers passed to us by the user. If we're doing this, then we'd have to make sure that every time we access the headers, we do the normalization on the fly. This makes me wince because it'd be easy to mess this up, but since h11 is a small library with a bounded scope it's probably possible. Specifically it looks like the places internally where we look at header names are: normalize_and_validate, get_comma_header, set_comma_header, Request._validate, write_headers. Also, if going this way I think we'd also want to lowercase when reading headers off the wire – we'd only want to preserve case for locally-provided headers.

Some timings:

Some timing measurements, not sure how meaningful they are

Just running `PYTHONPATH=$HOME/h11 python bench/benchmarks/benchmarks.py`: Current master: ``` 7136.9 requests/sec 7089.9 requests/sec 7263.6 requests/sec 7196.6 requests/sec 7254.9 requests/sec 7231.7 requests/sec 7156.6 requests/sec ``` Creating a dict mapping lowercase name → original name inside `normalize_and_validate` (and then throwing the dict away, just measuring the cost of creating it): ``` 6978.6 requests/sec 6994.7 requests/sec 7013.2 requests/sec 7039.9 requests/sec 7041.0 requests/sec 7028.8 requests/sec 6814.1 requests/sec ``` With preserving case of user-specified headers, and doing lowercasing on headers read off the wire, and whenever looking at user-specified headers (but I think this was missing a call to `.lower()` in `normalize_and_validate`): ``` 6879.6 requests/sec 6891.2 requests/sec 6979.2 requests/sec 6962.8 requests/sec 6931.5 requests/sec 6972.1 requests/sec 7004.6 requests/sec ``` Unconditional title-casing when writing to the wire: ``` 7006.7 requests/sec 7107.7 requests/sec 7004.1 requests/sec 7163.1 requests/sec 7150.7 requests/sec 7154.1 requests/sec 7160.4 requests/sec ```

Timings aren't the only important thing, but they are important to keep an eye on, since people (wisely or not) often make decisions about HTTP servers based on requests/sec microbenchmarks.

bradwood commented 5 years ago

The latter approach seems like a no brainer then unless my uninitiated mind is missing something. I assume you are just calling .title() before writing out the header.

It looks like the fastest, least intrusive, most isolated and easiest to maintain approach.

A small code change that can easily re refactored away down the line if needed

bradwood commented 5 years ago

Although breaks @mhils use case... In which case, maybe a switch to disable header munging?

njsmith commented 5 years ago

@mhils's use case is already broken by the existing header munging (see #47 for details), and I'm pretty confident that we don't want to let users see the original case on incoming headers regardless of what we do here (because there's literally nothing you can use that information for except to violate the spec, probably accidentally), and that also breaks @mhils's use case.

Also, a lot of what makes this tricky is that it's actually difficult for h11 to disable header munging, because h11 internally relies on the header munging for correct operation. So it's probably possible to add a switch, but it's not really any easier than any of the other solutions :-).

I suspect that eventually we will get pressured to support arbitrary case on outgoing headers in some way or another. But given that I still haven't thought of a simple way to do that, and the title-casing hack is so easy, I guess we should go ahead and do the title-casing thing for now. Tomorrow can worry about tomorrow :-).

The one concern that gives me pause is that if we do title-casing now, and then later switch to case-preserving, then in theory that could cause some compatibility issues, for code that passes in lowercase headers to h11 and is talking to a peer that requires title-casing – that works so long as h11 is fixing up the title-casing, but would break when h11 switched to case-preserving. However, code like this is presumably super rare, given that (a) 2.5 years into h11's lifespan we've only found 1 device that insists on title-casing, (b) the code that interacts with that device is actually passing in correctly-cased headers anyway.

njsmith commented 5 years ago

Oh and I'm currently a bit distracted by other things, but if someone wants to make a PR for title-casing outgoing headers that would be cool.

cmullendore commented 5 years ago

Copied from my own comment at https://github.com/containous/traefik/issues/466#issuecomment-535534743 :

It is not the responsibility of Go or any other language/platform to unilaterally decide how other systems should behave, including whether or not the systems being interacted with follow the case-insensitivity statement of the RFC.
A system/platform should not modify the information it is handling unless modification is the direct purpose of the request.
The code used to modify the casing of field names likely introduces unnecessary functionality that risks additional bugs (regardless of whether such bugs currently exist) and removing that (and in fact any) such code eliminates the risk that such code could introduce unintended behavior.
If the supposition that uses of the platform really are respecting the case-insensitivity nature of the current implementation and therefore the current behavior of case modification theoretically introduces no issues, then removing this behavior should also cause equally zero impact to users.
Removing this functionality provides platform users with the opportunity to pre-conform to HTTP/2 casing standards if they so choose, per https://tools.ietf.org/html/rfc7540:

A malformed request or response is one that is an otherwise valid sequence of HTTP/2 frames but is invalid due to the presence of extraneous frames, prohibited header fields, the absence of mandatory header fields, or the inclusion of uppercase header field names.

There is literally no reason for modification of this behavior to be resisted by this team.

njsmith commented 5 years ago

@cmullendore Hey, we appreciate contributions and feedback, but coming in and making your first post a bunch of shouting isn't cool. Please dial it down and read/listen before responding.

It is not the responsibility of Go or any other language/platform to unilaterally decide how other systems should behave, including whether or not the systems being interacted with follow the case-insensitivity statement of the RFC.

This is the entire point of h11 though. There are infinitely many ways you could potentially deviate from the HTTP spec, and it's impossible for an HTTP library to implement all of them, because it's, y'know, an HTTP library. And if you need some arbitrary non-HTTP functionality, there's nothing stopping you from opening a TCP socket and doing whatever you want.

That said, we do try to strike a balance between RFC conformance and real-world interoperability, and there do seem to be a lot of folks out there who are stuck working with broken servers they can't change, so we would like to provide some solution for them.

The code used to modify the casing of field names likely introduces unnecessary functionality that risks additional bugs (regardless of whether such bugs currently exist) and removing that (and in fact any) such code eliminates the risk that such code could introduce unintended behavior.

There is literally no reason for modification of this behavior to be resisted by this team.

Well, no. If you read the thread you're posted in, you'll see that the reason the decision is tricky is because preserving case is both riskier and slower than normalizing case. If we normalize, we deal with header case once up front; if we don't normalize, then we have to deal with header case repeatedly every time we touch the headers.

cmullendore commented 5 years ago

I appreciate the reply. I might suggest that putting a list of concepts together in numbered bullet points isn't really yelling... but... 🤷‍♂️

My thing is, we're really not talking about arbitrary functionality... we're talking about something that's really as simple as manipulating a dictionary... which is really, fundamentally, all that the headers object is. I think most people would say that calling .Add(obj) on a collection and having obj be modified as a part of that add would in any other situation categorized as at best unexpected and at worst unacceptable behavior... which my impression after reading the thread is pretty much the external viewpoint.

That said, we do try to strike a balance between RFC conformance and real-world interoperability, and there do seem to be a lot of folks out there who are stuck working with broken servers they can't change, so we would like to provide some solution for them.

Providing people with options is totally a great thing... presuming that 1) they are actively choosing to utilize those options and 2) the options meet their expectations. The problem with the above quote is that it presumes that everyone should want #1 so having the choice doesn't matter and the solution implemented (which no one can opt out of because of #1) is always going to be the right one. Neither of these two things are true... and the simple fact that this thread exists here and in many other places is a testament to that.

preserving case is both riskier and slower than normalizing case

I apologize but I don't get this because what people are literally asking for is that you do nothing. Though the implementation may vary (string copy, multiple reference, whatever), one would assume that not manipulating a string is easier than manipulating it.

if we don't normalize, then we have to deal with header case repeatedly every time we touch the headers

If we go back to the idea that everything should be case insensitive, which is the perspective being given... everyone else should be case insensitive and so manipulating the case shouldn't matter... then why is the requirement for case sensitivity... that is... the need to deal with it at all... such a requirement internally? If it should matter so little, why does it matter so much?

I'll admit maybe I was a bit abrupt... dropping in and just throwing down bullet points... but the perspective being given doesn't seem to align with user expectations/needs/requests and for some reason those expectations/needs/requests are being dismissed.

For kicks I glanced through the history of header.go (https://github.com/golang/go/commits/master/src/net/http/header.go) and I don't see anything that mentions why this was a good idea?

futursolo commented 5 years ago

I apologize but I don't get this because what people are literally asking for is that you do nothing. Though the implementation may vary (string copy, multiple reference, whatever), one would assume that not manipulating a string is easier than manipulating it.

Currently, h11 checks Content-Length and Transfer-Encoding so it can know what to do with the payload. If we do not .lower() or .title() all headers, it will be very difficult to look up values for these headers.

If we go back to the idea that everything should be case insensitive, which is the perspective being given... everyone else should be case insensitive and so manipulating the case shouldn't matter... then why is the requirement for case sensitivity... that is... the need to deal with it at all... such a requirement internally? If it should matter so little, why does it matter so much?

Well, should be. The real world is way less than perfect(Working the way it should be). What I would say is that Capitalising headers in HTTP/1.1 is a convention. But some people assume that it's how header names "should be". So they implemented their client/server that way.

So to increase compatibility for those incorrect implementations, we kind of have no choice but to capitalise them, which is the reason why I opened this issue in the first place.

njsmith commented 5 years ago

BTW, here's another idea for how to approach this:

Declare that going forward, we're going to always preserve case in outgoing headers – whatever case you use when passing headers into h11.Request and friends will remain visible in request.headers and be passed unchanged on the wire
Update all our code that looks at the headers to create temporary lowercased copies before doing anything with the header. According to the old benchmarks up above, this will only be about a 3-4% slowdown.
But, how can we be confident that we're catching all the places where we look at the headers and actually calling .lower() before doing any comparisons? It would be very easy for a bug to sneak in! Solution: use mypy to catch these bugs statically. For example, try running mypy --strict-equality on this code:

from typing import List, Tuple, cast

class UnnormalizedHeaderName:
    def lower(self) -> bytes:
        pass

HeadersType = List[Tuple[UnnormalizedHeaderName, bytes]]

def get_header(headers: HeadersType, needle: bytes):
    for name, value in headers:
        # mypy issues an error here:
        if name == needle:
            return value
        # but mypy says this is fine:
        if name.lower() == needle:
            return value

headers = [
    (cast(UnnormalizedHeaderName, b"My-Header"), b"my_header_value")
]

print(get_header(headers, b"my-header"))

Of course, to do this, we'd first have to get mypy running on h11. I know @pgjones wants to do this anyway... from a quick look, it wouldn't be too hard to get minimal stuff working. Currently the errors we get are:

h11/_util.py:96: error: Incompatible redefinition (redefinition with type "Callable[[Any, Any], Any]", original type "Callable[[Pattern[AnyStr], AnyStr, int, int], Optional[Match[AnyStr]]]")
h11/tests/test_util.py:5: error: No library stub file for module 'pytest'
h11/_events.py:28: error: Need type annotation for '_fields' (hint: "_fields: List[<type>] = ...")
h11/_events.py:29: error: Need type annotation for '_defaults' (hint: "_defaults: Dict[<type>, <type>] = ...")
h11/_events.py:92: error: Incompatible types in assignment (expression has type "None", base class "object" defined the type as "Callable[[object], int]")
h11/_events.py:291: error: Need type annotation for '_defaults'
h11/tests/test_state.py:1: error: No library stub file for module 'pytest'
h11/tests/test_headers.py:1: error: No library stub file for module 'pytest'
h11/tests/test_events.py:1: error: No library stub file for module 'pytest'
h11/tests/test_io.py:1: error: No library stub file for module 'pytest'
h11/tests/test_connection.py:1: error: No library stub file for module 'pytest'
h11/tests/test_connection.py:1: note: (Stub files are from https://github.com/python/typeshed)
h11/tests/test_against_stdlib_http.py:12: error: Cannot find module named 'urllib2'
h11/tests/test_against_stdlib_http.py:12: note: See https://mypy.readthedocs.io/en/latest/running_mypy.html#missing-imports
h11/tests/test_against_stdlib_http.py:12: error: Name 'urlopen' already defined on line 22
h11/tests/test_against_stdlib_http.py:17: error: Cannot find module named 'SocketServer'
h11/tests/test_against_stdlib_http.py:17: error: Name 'socketserver' already defined (by an import)
h11/tests/test_against_stdlib_http.py:22: error: Cannot find module named 'SimpleHTTPServer'
h11/tests/test_against_stdlib_http.py:22: error: Name 'SimpleHTTPRequestHandler' already defined on line 58

Almost all of these are due to py2/py3 version differences, which can be solved by replacing our current try/except ImportError (which mypy doesn't understand) with if sys.version_info >= (3, 0): ... (which mypy does understand). This is also what's happening with the _util.py error. We'd also need to figure out what to do with the dynamic junk in _events.py – maybe we could just switch to attrs?

sigmavirus24 commented 5 years ago

@njsmith perhaps you've already thought of this and I'm just providing an option that you've ruled out, but urllib3/Requests use a Case Insensitive Dictionary implementation that allows you to access a key with whatever casing you please. It's probably no more performant than what you propose, but it might make reasoning about all of this a tiny bit simpler.

njsmith commented 5 years ago

@sigmavirus24 yeah, I am aware of that option, but there are a few things that make it an awkward fit for h11. First, h11 is pretty low-level, so it currently represents headers as a simple flat-list-of-tuples. This is part of the public API (!), and also turns out to be pretty natural for the kinds of operations that h11 needs to do. Plus there isn't really a compelling widely-agreed-on case-preserving-case-insensitive-multidict implementation that's an obvious choice to make part of our public API.

And, because h11 is so low-level, pretty much any cleverness like "using custom data structures" ends up adding a ton of overhead, in percentage terms. Which wouldn't be an issue for urllib3/requests, but people like to choose servers based on stupid reasons like 10% differences in requests/second in microbenchmarks. And h11 is competing with libraries that cut corners on correctness or are written in memory-unsafe languages like C. So I think it is worth jumping through some hoops to try to avoid giving people excuses not to switch.

So that's why I don't think that approach is a good fit for h11.

njsmith commented 5 years ago

It looks like someone posted here and then their comment was somehow deleted? Not sure what happened exactly.

Anyway, they said that they needed to preserve case in headers because there are Chinese companies that do signature verification on headers.

Of course the proper way to do signature verification on headers is to normalize the case before computing the signature. That's what AWS, for example, does when making signed requests. But I can believe that there might be folks out there who get this wrong. Though it's also possible that they do normalize case correctly, but this poster missed that detail. (It's an easy detail to miss; I bet the vast majority of AWS users don't realize that the signatures use normalized headers.)

If anyone knows any more about this, I would love to know:

any examples of sites that do this
confirmation that the signatures really do require case preservation
whether the case preservation is only needed on outgoing messages, or whether you also need it on incoming messages

sethmlarson commented 4 years ago

Another data point of a real-world service not treating headers case-insensitively: https://github.com/encode/httpx/issues/538#issuecomment-555120743

tomchristie commented 4 years ago

Another data point: https://github.com/encode/httpx/issues/728

whether the case preservation is only needed on outgoing messages, or whether you also need it on incoming messages

My feeling here would be that even if it's only "needed" on outgoing messages, there's probably? a UX argument in favour of case-preserving in both directions. (Or at least, I think that'd be helpful for HTTP clients, where I'd rather any debugging or technical output showed the literal bytewise headers, as they were sent by the server.)

Presumably an implementation would need to preserve the existing behavior by default.

Would an implementation that added h11.Connection(preserve_header_casing=<bool>) be acceptable here? The h11.Request() and h11.Response() objects themselves would have to always preserve header casing, with the configuration only existing at the connection level.

njsmith commented 4 years ago

Preserving the raw headers on incoming messages is tricky. I hear you on case being nice to have in debug output, but on the other hand... preserving non-meaningful case is also the root cause of all these bugs where people accidentally write case-sensitive programs.

One thing I've wondered is whether we should explicitly save the raw headers separately as a dedicated debugging feature (e.g. as a raw byte string, so you can see all the whitespace, header case, etc.). There's also some discussion in https://github.com/python-hyper/h11/issues/92 of doing more normalization of header values, which would be another case where separate debug output might be useful.

AcckiyGerman commented 4 years ago

I now got the same problem as @bradwood - trying to connect from trio-websocket to a websocket server, which returns HTTP's 400 error, when sees upgrade: websocket instead of Upgrade: websocket in headers. Monkey-patching the h11/_headers.normalize_and_validate method breaks the h11 state machine somehow, so after receiving the upgrage respond from the server, I got

h11._util.RemoteProtocolError: Received server _SWITCH_UPGRADE event without a pending proposal

UPDATE: Due to https://tools.ietf.org/html/rfc6455#section-1.3 , websocket opening handshake is:

GET /chat HTTP/1.1
Host: server.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
...... < and so on >

so RFC 6455 violates RFC 2616 :(

python-hyper / h11

Discussion: Capitalize the header names of outgoing headers. #31