python-hyper / h2

HTTP/2 State-Machine based protocol implementation
https://h2.readthedocs.io/en/stable
MIT License
967 stars 155 forks source link

Add support for sendfile() based DATA frames. #236

Open Lukasa opened 8 years ago

Lukasa commented 8 years ago

Right now there is no support for HTTP/2 using sendfile to send DATA frames. This is potentially inefficient for implementations that are able to send really sizeable DATA frames. Given that sendfile allows users to send fixed length data, it would be nice to provide some way to say "send these bytes, then do a sendfile with this fobj and this length".

@njsmith's h11 library has this capacity, but this relies on the fact that h11 has no internal buffer: each "event" call returns the bytes required for that event directly without writing into a buffer. This allows for the subtle changing of return value in comparison to send, which is not so naturally achievable with h2.

Coming up with a good design here is a bit tricky. It may be that this should be an optional switch on the H2Connection class that affects the return value of data_to_send(), changing it to be an iterable of bytes and sentinel objects where each sentinel object is . Alternatively, we could go further and say that not only does the optional switch need to be enabled but there is also a separate "get the data I need" function that conforms to this new API.

Another possible option that leaps out to me is to have a subclass (eww) or some other type that implements this support as a wrapper around the base H2Connection object.

The final possible option is a dramatic API change. That changes the signature of receive_data to return two values: the events iterable and any bytes that may need to be sent. If we do that, we can then remove hyper-h2's internal buffer and then delegate entirely to the calling code by having all do_event() type functions simply return the data they generate rather than storing it internally. That's a very large API break, but it allows supporting this use-case more cleanly by simply emulating what h11 does.

I'd like opinions from @python-hyper/contributors. Any thoughts on API choices? Is this worth supporting at all?

Lukasa commented 8 years ago

Another option is to have data_to_send() be called repeatedly until exhaustion, and each time it will return either bytes or the sentinel object. That allows for keeping the API change relatively small and isn't entirely absurdly surprising. I'd still want to ensure that sendfile support is explicitly opted in to, because otherwise the API represents a potentially surprising footgun. Essentially, you set a flag to say "I know what I'm doing", and then do it.

Lukasa commented 8 years ago

While we're here, we should note the caveats of sendfile in a H2 environment. In particular, we have the following constraints:

This somewhat increases the overhead and decreases some of the utility of sendfile.

jchampio commented 8 years ago

This somewhat increases the overhead and decreases some of the utility of sendfile.

Agreed. But I think the ability to send large resources from disk without buffering them in Python would be a good thing in and of itself, even if there were no speed/performance improvements.

Lukasa commented 8 years ago

The other concern about sendfile, as pointed out by @glyph, is that sendfile does not work when combined with a userspace TLS implementation like OpenSSL (at least at this time). That further limits the utility of this change to HTTP/2 in plaintext.

jchampio commented 8 years ago

Ugh... that is an excellent point. The danger of doing my development work in plaintext is that I sometimes forget that browers don't speak it...

So, for now, such a feature would only generally be useful for people who have non-browser clients and don't want/need TLS. Still a use case that exists, but much more niche than I was originally imagining.

Lukasa commented 8 years ago

Yup, agreed. I think it's still worth doing, especially as there's the possibility of AF_ALG-based sockets being used in the future which would be compatible with sendfile.

njsmith commented 8 years ago

I have to admit that h11's sendfile support is more of a cute trick than anything carefully considered. I am somewhat dubious about whether one can get all the other python overhead low enough for sendfile to actually make any difference (cf Amdahl's law), but would be very interested to hear if anyone tries it. In the mean time h11 mostly just supports sendfile because I realized it would be trivial to do :-)

rbtcollins commented 8 years ago

The main attraction for sendfile is zero-copy overhead (which can be gotten other ways), but yeah any crypto will imply at least one read of the memory. I'm extremely dubious that directly supporting sendfile makes sense: I'd much rather make sure we can deliver a zero-spurious-copies guarantee, that is that we won't do anything daft like take slices of the input data - and instead make sure we use memoryviews and the like right up until its handed off to TLS.

njsmith commented 8 years ago

I'd much rather make sure we can deliver a zero-spurious-copies guarantee, that is that we won't do anything daft like take slices of the input data - and instead make sure we use memoryviews and the like right up until its handed off to TLS.

So in a sense, this is exactly how h11's sendfile support works. Normally for convenience, h11.Connection.send returns a single flat buffer, which necessarily implies some copying to concatenate framing and payload. But there's also a send_with_data_passthrough method which returns a list of bytes-like objects. (Regular send is just a tiny wrapper that calls this method and then returns b"".join(fragments).) And we guarantee that if you pass in a payload buffer, then it will pass through the machinery and exactly the same object will come out the other side. If you pass in a memory view, then you get out that same memory view with no copying.

Then we also make the guarantee that the only thing we do with payload buffers is call len on them, so you can also pass through opaque handles pointing to data on disk if you want. But that's the easy part: once you have an interface that can support zero copying, then you can also support sendfile.

The problem for h2 is that the data_to_send interface can't do either, because it always flattens data into a single buffer, which in python generally requires two copies (once to append to the buffer, and again to pull stuff out).

glyph commented 8 years ago

We used to be very careful about this sort of thing in Twisted, and it was one of the major regressions in Python 3 that memoryview was missing a bunch of necessary features to get the appropriate zero-copy (or, really, one-copy) guarantees that we wanted. I am fairly sure that it's all fixed now, but someone would need to do the investigation to see what version of python 3 introduced all the necessary operations; this may involve carrying a compat shim or dropping <3.4 or something like that.

glyph commented 8 years ago

(Calling it a "regression" because the buffer builtin, now gone in favor of memoryview, did handle these cases correctly)

Lukasa commented 8 years ago

So data_to_send really only implies one spurious copy unless you use the optional amt argument. If you leave that argument as None we don't copy the buffer, we just hand it off to you, no questions asked. That means the only copy is from the byte object originally provided to send_data into the serialized buffer. However, that's also exactly one copy, which all things considered is pretty damn good.

Lukasa commented 7 years ago

I'm postponing this to 4.0.0.