python-hyper / h11

A pure-Python, bring-your-own-I/O implementation of HTTP/1.1
https://h11.readthedocs.io/
MIT License
490 stars 62 forks source link

Preserve header casing #103

Closed tomchristie closed 4 years ago

tomchristie commented 4 years ago

A proposal for preserving header casing information in both directions, while still exposing lower-case only to users by default, and keeping careful type separation.

Essentially a take on @Lukasa's comment here https://github.com/python-hyper/h11/issues/31#issuecomment-281013856 and @njsmith's follow up...

write a data structure that is basically a tuple, but allows you to access headers either case sensitively or case insensitively (where case insensitively is the default)

Here's a really simple (maybe too simple) idea to throw into the mix: use tuples of (header, header_with_original_casing, value)

Here's how it looks to the end-user...

I've addressed this commit-by-commit, which should help make the approach I've taken here clear...

  1. Headers becomes a Sequence type, rather than a raw mutable list, but continues to store exactly the same information.
  2. Store raw casing information in the Headers type, but don't use it anywhere.
  3. Use title casing anywhere we're using get_comma_header, set_comma_header. Both functions continue to be case insensitive in their effects, but it will matter in the set_comma_header case because it'll give us nice header casing on the over-the-wire bytes. Switching both over within the codebase for consistency.
  4. Switch the writer to use the raw header casing.

Strictly speaking there is an API change here, in that .headers on events are now sequences, rather than plain lists. Any user code that is doing grungy stuff by mutating that data-structure in-place, wouldn't function after this. But that's a bit of a hacky broken thing to be doing anyway, so a version bump that tightened up the API spec into ".headers is an immutable sequence" seems reasonable enough right?

Anyway's putting this out there, so we've got something to discuss. 🤔

Thanks so much for maintaining such a fantastically careful & thoroughly designed library. It's a joy to work with. ✨

codecov[bot] commented 4 years ago

Codecov Report

Merging #103 into master will increase coverage by 0.01%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #103      +/-   ##
==========================================
+ Coverage   99.14%   99.15%   +0.01%     
==========================================
  Files          21       21              
  Lines         937      950      +13     
  Branches      173      173              
==========================================
+ Hits          929      942      +13     
  Misses          7        7              
  Partials        1        1              
Impacted Files Coverage Δ
h11/tests/test_connection.py 100.00% <ø> (ø)
h11/tests/test_io.py 100.00% <ø> (ø)
h11/_connection.py 100.00% <100.00%> (ø)
h11/_headers.py 100.00% <100.00%> (ø)
h11/_writers.py 88.57% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1540843...a0eaaf8. Read the comment docs.

Kriechi commented 4 years ago

Just as a comment or maybe for additional inspiration: This is how we solved a similar problem in the mitmproxy project: https://github.com/mitmproxy/mitmproxy/blob/master/mitmproxy/net/http/headers.py

tomchristie commented 4 years ago

Okay, based on the feedback here's a slight retake on this, which I think exposes a neater interface and has a pretty minimal change footprint.

After this pull request...

Internally...

njsmith commented 4 years ago

What do you get from PYTHONPATH=. python bench/benchmarks/benchmarks.py before and after this change?

tomchristie commented 4 years ago

Ah thanks I'd just been looking into that, but I'd not seen the benchmark.

Just as a rough guideline, I've taken the following to get ideas of comparative performances of the different approaches here...

import h11
import timeit

def send_request():
    conn = h11.Connection(our_role=h11.CLIENT)
    headers = [
        (b'Accept', b'*/*'),
        (b'Accept-Encoding', b'gzip, deflate'),
        (b'Connection', b'keep-alive'),
        (b'Host', b'www.example.org'),
        (b'User-Agent', b'HTTPie/2.2.0'),
    ]
    request = h11.Request(method="GET", target="/", headers=headers)
    bytes_to_send = conn.send(request)

print(timeit.timeit(send_request, number=100000))

Which comes out with...

I'll take a look at the proper benchmarks now...

tomchristie commented 4 years ago

Existing...

$ PYTHONPATH=. venv/bin/python bench/benchmarks/benchmarks.py
7389.5 requests/sec
7457.9 requests/sec
7451.1 requests/sec
7445.4 requests/sec
7434.2 requests/sec
7428.8 requests/sec
7447.1 requests/sec

Headers as a sequence...

$ PYTHONPATH=. venv/bin/python bench/benchmarks/benchmarks.py
6393.3 requests/sec
6404.6 requests/sec
6369.1 requests/sec
6346.9 requests/sec
6372.4 requests/sec
6388.1 requests/sec
6403.9 requests/sec

Header instances...

$ PYTHONPATH=. venv/bin/python bench/benchmarks/benchmarks.py
5851.9 requests/sec
5897.9 requests/sec
5894.1 requests/sec
5873.7 requests/sec
5875.9 requests/sec
5871.4 requests/sec
5903.8 requests/sec
njsmith commented 4 years ago

The benchmark I cited isn't terribly clever, but it does exercise a full request/response cycle with some realistic headers.

tomchristie commented 4 years ago

Closing this off in favour of #104