WSGI todos should be applied

rbtcollins commented 10 years ago

https://mail.python.org/pipermail/web-sig/2010-September/004655.html as a list of things that should be made MUSTs in WSGIv2. We should apply them to the spec.

rbtcollins commented 10 years ago

I've applied all except " (In WSGI 2, language should be clarified to allow the input stream length and CONTENT_LENGTH to be out of sync, for reasons explained in Graham's blog post.)"

I don't understand this one well enough to apply it - I think its suggesting that middleware might transform the body (e.g. doing TE on it, though PEP-3333 prohibits that). I couldn't the the blog post in question, unless its http://blog.dscpl.com.au/2009_09_01_archive.html - but that doesn't suggest permitting uploaded content to mismatch CONTENT_LENGTH as far as I can tell - seems like we want anything that is going to manipulate the uploaded content to strip CONTENT_LENGTH: CONTENT_LENGTH is solely framing for detecting disconnections anyway.

rbtcollins commented 10 years ago

Keeping this open until we can talk with Graham

GrahamDumpleton commented 10 years ago

This is related to where a request specifies a transfer encoding or a content encoding.

Now the WSGI specification says some things about CONTENT_LENGTH, but one thing it doesn't say anything about is what happens when CONTENT_LENGTH isn't supplied.

The CGI specification does however say:

4.1.2.  CONTENT_LENGTH

   The CONTENT_LENGTH variable contains the size of the message-body
   attached to the request, if any, in decimal number of octets.  If no
   data is attached, then NULL (or unset).

      CONTENT_LENGTH = "" | 1*digit

   The server MUST set this meta-variable if and only if the request is
   accompanied by a message-body entity.  The CONTENT_LENGTH value must
   reflect the length of the message-body after the server has removed
   any transfer-codings or content-codings.

and given that the WSGI specification itself didn't say anything about how things should be interpreted in the absence of CONTENT_LENGTH, things default back to this.

So what the CGI specification says is that if there is no CONTENT_LENGTH it should be interpreted that there was no data. Although not stated in the WSGI specification, WSGI applications therefore implement that, with CONTENT_LENGTH being taken as being 0 if not set or if an empty string.

When it came to chunked transfer encoding the comment in the CGI specification also though said:

   The CONTENT_LENGTH value must
   reflect the length of the message-body after the server has removed
   any transfer-codings or content-codings.

This requirement was never a practical one though as it would require a web server to buffer up request content indefinitely of an unknown length. Thus web servers such as Apache simply never allowed chunked request encoding for a CGI script and would return with a HTTP error of 'Length Required'. Same with mod_fcgid and mod_scgi, neither allow chunked request content.

In the beginning mod_wsgi didn't allow chunked request content either, again because it simply isn't practical to buffer the request content in order to calculate the actual content length so that the header can be set before the request is passed onto the WSGI application.

Because though of the reoccurring requests to support chunked request content, support for chunked request content was subsequently added to mod_wsgi a number of years ago back before the changes that eventually made it into PEP 3333 were discussed, but in doing so it relied on users ignoring the rules about CONTENT_LENGTH inherited from the CGI specification. This is because mod_wsgi would not buffer the request content and would instead simply pass it through after it was dechunked, with CONTENT_LENGTH simply not being set.

Specifically, it relied on a user ignoring what the CGI specification said about interpreting the CONTENT_LENGTH to be 0 if not set or an empty string.

What it thus relied on was users doing something like:

    block_size = 8192

    data = input.read(block_size)
    while data:
        process(data)
        data = input.read(block_size)

In other words, it relied on simply reading wsgi.input until an empty string was returned and ignore CONTENT_LENGTH as far as how much data was read.

Worth noting that is that PEP 333 didn't actually guarantee that an empty string would be returned on end of valid input. In practice though a WSGI application had to deal with an empty string as meaning end of input. Further, a WSGI server had to guarantee that an empty string was returned on end of input as well.

The latter is because although the WSGI specification said a WSGI application shouldn't read more than CONTENT_LENGTH you really couldn't trust WSGI applications. As such a WSGI server had to provided a limited stream for wsgi.input which guaranteed it. Some early WSGI servers didn't do this and instead provided for wsgi.input the raw socket.

This all would break for two reasons. The first is that a WSGI application could read more than CONTENT_LENGTH and if it was HTTP/1.1 connection with keep alive it would then block indefinitely. The second is that if the WSGI application didn't read all the request content and left some behind, the WSGI server wouldn't know how much was read and so wasn't in a position to properly discard the remainder of the request content so allowing it to handle a subsequent request on the same HTTP/1.1 keep alive connection.

End result was that PEP 3333 added the new requirement that an empty string had to be returned for end of valid input. Part of the reason for pushing that requirement though was specifically as a companion to the change which was proposed related to CONTENT_LENGTH to support chunked request content.

That is, one wouldn't be able to do the above where one simply read to end of input with out that guarantee.

So this is the first part of why it was proposed that CONTENT_LENGTH in effect be advisory only and not definitive. That is, to be able to support chunked transfer encoding for request content.

There is a second reason though, and that is the case of content encoding for a request.

The common case here is users who want to compress request content.

The issue here is that the CONTENT_LENGTH in this case is that of the compressed content. If one had to ensure that the CONTENT_LENGTH is accurate when decompressing request content, then one would again have the problem of needing to buffer the complete request content, which as stated before isn't practical.

You have two alternatives of what you could do here.

The first is that any mutating input filter as I will call them can remove the CONTENT_LENGTH and then stream the now decompressed request content. With the new rule that a WSGI application should simply read to end of input, it would just slurp up the data and process it. This would be similar to the chunked request content case where there is no CONTENT_LENGTH.

A problem with this though is that in removing CONTENT_LENGTH you are discarding information and one thing it would block is the ability of a WSGI application down stream from the compression to implement a check on the original size of the request content and potentially return a 413 Request Entity Too Large before consuming the request content.

The second alternative is therefore to leave CONTENT_LENGTH intact and accept that it is advisory only in respect of the original content length before any content encoding has been removed.

This second approach is what Apache itself does. That is, the mutating input filter need not be restricted to just a WSGI middleware handling decompression in Python.

For Apache one can use mod_deflate to decompress request content and that would be done even before it gets passed to a handler such as mod_wsgi.

Even outside of WSGI one can use mod_deflate with CGI scripts. In this case the CONTENT_LENGTH header is left intact and so is effectively wrong, but the complete decompressed input is still passed across. if a CGI script then believes CONTENT_LENGTH and only reads that much per the CGI specification, it will be truncate the request input at the original compressed content length, with a subsequent loss of data. Same issue with mod_fcgid and mod_scgi.

The same problem obviously also occurs with mod_wsgi if mod_deflate is used.

So we have to problem cases.

The first is chunked transfer encoding where there will be no CONTENT_LENGTH due to buffering of request content to calculate it being impractical.

The second is a content encoding such as gzip, where one doesn't want to discard CONTENT_LENGTH, but where its value can then be wrong as the actual amount of content can be more.

The proposed change therefore was to say that lack of CONTENT_LENGTH, or it being an empty string doesn't imply that there is no data. Further, that a WSGI application shouldn't read up to CONTENT_LENGTH but should instead read until an empty string is returned.

As to the current state of mod_wsgi, the support for chunked transfer encoding was contentious enough at the time, due to it changing how one was to interpret CONTENT_LENGTH, that it was put behind an optional switch. Thus to enable it one had to set:

WSGIChunkedRequest On

This was a bit stupid in hindsight and probably should have simply made it the default and be done with it. Part of the reason for that is that use of mod_deflate would put you in the same situation of having to ignore CONTENT_LENGTH anyway even without setting some WSGI specific directive to have enabled something.

But then, it may have been fortuitous anyway that a directive was used for chunked request content support as it turned out that support for it in daemon mode of mod_wsgi was broken. The rejection of the change from PEP 3333, how PEP 3333 shut down all real discussion after and the frustration that resulted, meant I lost all interest in mod_wsgi for a few years and couldn't be bothered fixing up things for daemon mode. I do now have it working for daemon mode though, albeit the changes haven't been released as yet.

In the unreleased changes I was going to make support for chunked request content the default and it couldn't be turned off, but have reverted that again for now and rely on the directive being set.

As to other WSGI servers, the CherryPy WSGI server also supports chunked request content in the same way. That is, stream the data but where CONTENT_LENGTH is not set. For the CherryPy WSGI server this is the default behaviour. As such, if a WSGI application isn't ignoring CONTENT_LENGTH and is assuming that lack of it means 0 means that any chunked request content will be ignored with it seen as an empty request.

As to WSGI framework support, Armin Ronacher had enough requests that he changed Werkzeug to ignore CONTENT_LENGTH and simply read to end of input. This included support within the Werkzeug builtin WSGI server. I am not sure what version this was added in and whether he relied on it being able to detect in some way whether a WSGI server guaranteed that an empty string would be returned.

This highlights one of the mistakes in PEP 3333. It was originally proposed that wsgi.version be updated to be 1.1. This was also rejected. If it had it would have allowed code such as in Werkzeug to use the version as 1.1 as a guarantee that an empty string would be returned on end of input and we could have simply ignore that PEP 3333 didn't actually change the meaning of CONTENT_LENGTH.

Armin and I discussed having some other WSGI environ key to denote that this sort of input with non CONTENT_LENGTH situation exists, but I wasn't too keen on the key name. I don't know what he ended up doing. My preference would have been for a generic wsgi.extensions key which would be a set of feature flags related to the underlying WSGI implementation itself if we can't get it into the specification. This would be distinct from the idea of x-wsgiorg.* like keys relates to WSGI middleware riding on top.

Finally for now, it should be noted that supporting chunked request content and mutating input filters such as mod_deflate has some important implications for any WSGI server implementation which involves proxying. This mainly relates to mod_wsgi daemon mode, but could conceivably apply to other implementations as well if they use that approach. I will not explain these issues in this update, but will do so in a later update.

Some of that discussion will relate to detection of premature end of connections in general as well though, so does extend beyond proxying arrangements as well.

Also, I am well aware of issues around DOS vectors based on unknown content length. I will also deal with that in a later update.

rbtcollins commented 10 years ago

HTTP/2 adds some more cases into this. TE is entirely banned in HTTP/2 - so for any request (or response for that matter) content-length MAY be set, even if streaming is in play, and if set must be the sum of the bytes sent for the message (or the message must be permitted to have it set with no body - e.g. a HEAD reply).

That is, TE (and thus chunked inputs) is a temporary aberration, we need to support it for years but it shouldn't be the common case we design for.

Would this work? Specifically can one still get at the original Content-Length header when using mod_deflate on requests...

  CONTENT_LENGTH is an optional variable that may be present. 
  If set to 0 there is no body that can be read. When present and 
  not 0 it is the value of the HTTP Content-Length header field.
  Note that the length of the bytes in the input stream may be different
  to (less than or greater than) CONTENT_LENGTH even when it is
  present: use of input filters such as mod_deflate (to decompress
  content-encoded uploads) will transform the input stream and does not
  replace CONTENT_LENGTH because doing so in some cases requires
  fully buffering arbitrarily large message bodies which is detrimental to
  performance and responsiveness, as well as precluding useful features
  such as the ability to error early on bodies that will upload too much data.
  Middleware that alters the input stream must ensure that CONTENT_LENGTH
  continues to obey these definitions.

GrahamDumpleton commented 10 years ago

If in HTTP/2 'content-length MAY be set', and thus is not 'MUST be set', then the lack of content length is going to be exactly the same as when chunked content is used now. Remember that in HTTP/1.1 with chunked content the WSGI application never deals with the actual chunking mechanism and always deals with original unchunked stream anyway. So this whole general problem is still valid for HTTP/2 and possibly more so as there may well be a move to not using content length so much. Although, it is only relevant if we end up with an adapter between a new HTTP/2 API and a legacy WSGI application running on top. It is not relevant if we end up with two separate worlds and the HTTP/2 API is very different.

As to your suggested text, it doesn't seem to say anything specifically about the situation where CONTENT_LENGTH doesn't exist or it isn't clear enough. This is the case we have now and because of that people fell back to using the CGI definition. So a comment about that needs to be explicit to avoid confusion.

Anyway, we don't have to come up with exact text yet of any change. There are still many side issues related to this to explain first. :-)

python-web-sig / wsgi-ng

WSGI todos should be applied #3