Can't get chunked requests to work

ThorbenJ commented 11 years ago

Hi,

I've been pulling my hair out all day over this... but it seems chunked request bodies are totally broken in various places of Plack.

I'll start at the top of the stack and work my way down:

Plack::Request

Even though HTTP::Body supports chunked bodies, if you give it content length undef; _parse_request_body won't even give it the chance. Why does Content-Type also have to be set? I don't know.

If Content-Length or Transfer-Encoding is set then there is a body.

You might also want to consider Trailer headers appearing inside the body, between and after chunks.

Middleware

Given how Plack::Request is coded, I decided to write a quick middleware to de-chunk the body and set the headers it expects:

sub call
{
    my ($self, $env) = @_;

    goto DECHUNK_EXIT unless (
        $env->{'HTTP_TRANSFER_ENCODING'} =~ /Chunked/i
        && !defined $env->{'HTTP_CONTENT_LENGTH'}
    );

    use bytes;
    my $buf = Stream::Buffered->new();
    my $sock = $env->{'psgi.input'}; #$env->{'psgix.io'}
    #Test if fd (socket)

         #last _get_size will return 0 or undef
    while (my $expect = _get_size($sock)) {
        my $data = _get_data($sock, $expect);

        goto DECHUNK_EXIT unless (length($data) == $expect);
        $buf->print($data);
    }

    $env->{'HTTP_TRANSFER_ENCODING'} =~ s/\s*Chunked\s*,?//i;
    $env->{'CONTENT_LENGTH'} = $buf->size();
    $env->{'CONTENT_TYPE'} //= 'data/unknown'; #get around bad logic in Plack::Request
    $env->{'psgi.input'} = $buf->rewind();

DECHUNK_EXIT:
    return $self->app->($env);
}

_get_size and _get_data are trivial timeout sysreads (also tried read (i.e. perl buffered io)).

This did not work for every PUT on the same Plack server nor consistently with any one server I tired.

HTTP::Server:PSGI

Expects to see the Content-Length header, otherwise points psgi.input to a buffer containing what ever is left over from the header read (minus any part that was considered to be a header). Makes no attempt to deal with Transfer-Encoding, which could be ok, but also makes it impossible for some middleware to deal with it.

It needs to either do none greedy reads when reading the header, or deal with the body; as it is it does neither.

Tiggy::Server

Thought I'd try this as my test server as an alternative to HTTP::Server::PSGI, but it also make the same wrong assumptions. My favourite being that it sets psgi.input to a null IO if there is no Content-Length header. Reads for the header are also greedy. I don's see any chance for middleware or apps to recover.

In common

psgix.io is unlikely to be in the correct position and without knowing more info such as the header size a seek is impossible. (If it is seekable)

Both implementations are mixing sysread and perl buffered read or <> (same thing)

I don't think it is exotic for these servers, as I see it: mostly there for test & development, to properly support HTTP 1.1 or at least allow an app running under them to support it. At the very least chunked encoding. I've made no mention so far about connection keep-alive also be broken.

If I need support HTTP 1.1 and most importantly chunked encoding:

Should I give up on these servers?
Are there others I could try?
Will I have the same problems when moving to FCGI in a production env?

I case anyone is wondering, I've written my own WebDav framework for Plack and it pretty much passes all litmus tests, but is not playing well with Mac OS X, due to OS X's expectation to be able to use chunked encoding for PUT. (WebDav requires HTTP 1.1 and HTTP 1.1 requires chunked encoding, so OS X can't be faulted)

Regards, Thorben

miyagawa commented 11 years ago

This is rather a long issue to reply individually, but in general, dechunking HTTP request should be done on the HTTP server level (be it frontend server like nginx, or PSGI server such as Starman), and on PSGI level, psgi.input should return the decoded body (and ideally Content-Length header should point to the right value) i.e. transparent to the hosted PSGI application.

Looks like your middleware sounds like the right approach for servers that do not support dechunking, however i'm puzzled why you're using psgix.io instead of psgi.input. Maybe you can elaborate how that doesn't work on a separate issue or on a mailing list/irc channel specifically on that issue.

ThorbenJ commented 11 years ago

As I said in my post neither HTTP::Server:PSGI nor Twiggy::Server leave psgi.input in a usable state.

For example, if Content-Length is missing (as it would be) Twiggy points psgi.input to $null_io; which itself is open $null_io, '<', \'';

My middleware will get nothing from that. HTTP::Server::PSGI leave psgi.input similarly broken.

miyagawa commented 11 years ago

I see. that can probably be remedied separately, but making a dechunked psgi.input out of psgix.io doesn't sound bad then.

ThorbenJ commented 11 years ago

Unfortunately psgix.io can't be reliably used either, because neither server are careful when reading for headers. They mix in buffered io or read too far on their own. After a day of trying I never got it to work 100% with psgix.io; although better than 0% with psgi.input.

ThorbenJ commented 11 years ago

Thanks for replying. From what you've said, my answers are: 1&2) yes, use starman or a full httpd; 3) no, a with a http server in the front, chunked bodies should get resembled, without the app noticing.

I think for testing I can live with my middleware operating on psgix.io with Twiggy as the server. Not sure why it doesn't work for every file, but it'll do.

billmoseley commented 11 years ago

I ran into this today, too. In my case a Node.js server using "SuperAgent" was POSTing form-data using to a Catalyst application with Transfer-Encoding: chunked. Catalyst::Request does not support chunked requests -- it assumes there's a content-length.

What I'm not clear from this thread is if Catalyst::Request should handle chunked or not -- or if it's the responsibility of the web server to "de-chunk". Is it true that since Catalyst reads from psgi.input that it's impossible for Catalyst to handle chunked currently?

I can run the Catalyst app with Starman and the chunked requests are converted to non-chunked requests. That works fine.

But, Apache (what we use) passes the request through as-as, so the Catalyst app cannot process the request. Should Apache be de-chunking? Or is this the responsibility of the middleware?

What do you suggest I do to allow Apache to handle these chunked requests? And likewise for the Catalyst dev server?

ThorbenJ commented 11 years ago

As far as I understood, the server needs to de-chunk and pass a suitable psi.input with a content-length.

You will find code implemented according to this assumption all over the place.

I wrote a middleware that mostly works with Twiggy, but not so good with HTTP::Server::PSGI. I have not tried it with Apache.

It simple takes psgix.io, which should be the original upstream socket, and created a new psgi.input from it. This works as long as nothing has read psgix.io any further than the headers; if not the first chunk sequence would be missed.

jjn1056 commented 11 years ago

As far as support for this in catalyst, if someone gets a broken test case to me, I would take on fixing it

billmoseley commented 11 years ago

On Tue, Jun 4, 2013 at 7:40 AM, John Napiorkowski notifications@github.comwrote:

As far as support for this in catalyst, if someone gets a broken test case to me, I would take on fixing it

Is it even possible to fix in Catalyst? I took a stab at it last night (before finding this thread) and realized that ->read_chunk (psgi.input) didn't have anything to return when there's a chunked request.

Bill Moseley moseley@hank.org

jjn1056 commented 11 years ago

I'm not sure what it would take, but I was able to staightforwardly add a request cycle jailbreak to support raw psgix.io

https://github.com/perl-catalyst/catalyst-runtime/commit/eb1f4b491f735a82e8f1279e96c5ba4f6bf5365c#L1L1795

and that let me build a websockets chatserver in Catalyst, so I think its pretty possible. Just I need a more clear idea of what is now busted.

billmoseley commented 11 years ago

On Tue, Jun 4, 2013 at 8:00 AM, John Napiorkowski notifications@github.comwrote:

I'm not sure what it would take, but I was able to staightforwardly add a request cycle jailbreak to support raw psgix.io

perl-catalyst/catalyst-runtime@eb1f4b4#L1L1795https://github.com/perl-catalyst/catalyst-runtime/commit/eb1f4b491f735a82e8f1279e96c5ba4f6bf5365c#L1L1795

and that let me build a websockets chatserver in Catalyst, so I think its pretty possible. Just I need a more clear idea of what is now busted.

Ok, I wasn't clear about psgix.io from this thread -- I didn't look at that yet.

We should take off this list, but basically Catalyst::Request always assumes a content-type header -- look at prepare_body. If there's no length it just sets $self->_body(0);

If it helps, I was attempting to use the example (in php) in this post for parsing the stream. http://stackoverflow.com/questions/11125463/if-connection-is-keep-alive-how-to-read-until-end-of-stream-php

BTW -- I ran out of time last night looking at this but was curious how keep-alives and also compression might fit into this. I wanted to look at both compression in Apache config and also on our load balancer.

Bill Moseley moseley@hank.org

miyagawa commented 11 years ago

To clarify, my take on this is:

web servers may dechunk requests, like Starman or apache(?) does
you can write middleware that does it automatically

having said that, middleware can act as a server (from app's view point) and as an app (from server's view point) - so if you can do it in middleware you can do it in apps too, but it's better if frameworks don't need to care about it.

It's unfortunate that middleware needs to fiddle with psgix.io - i'm happy to take a patch or failing test case to see if we can improve the non-supported servers (HTTP::Server::PSGI and Twiggy) to leave psgi.input untouched.

billmoseley commented 11 years ago

@miyagawa, you say servers may dechunk and Apache/mod_perl2 does. Calling $env->{'psgi.input'}->read does return dechunked body. But, the Transfer-Encoding: chunked is still in the environment, which is rather confusing. (How would Middleware know if ->read returns chunked or dechunked data?)

It's not clear what should happen from the RFC, but from this example: http://tools.ietf.org/html/rfc2616#section-19.4.6 It appears that if Apache/mod_perl dechunks then it should also remove the Transfer-Encoding: chunked header and add a Content-Length. Would you agree?

Where should this be handled? Should Plack::Handler::Apache2 read the request data (flushing to disk) and reset the headers? Or is this something that should happens in Middleware?

BTW, when using the Catalyst development server and the request is chunked I also have to call ->read on psgix.io. If I use psgi.input I can read the header of the first chunk and then I can read the chunk data and trailing CRLF, but next time I call ->read the method reports zero bytes read. Any idea what is going on there?

miyagawa commented 11 years ago

you say servers may dechunk and Apache/mod_perl2 does. Calling $env->{'psgi.input'}->read does return dechunked body. But, the Transfer-Encoding: chunked is still in the environment, which is rather confusing. (How would Middleware know if ->read returns chunked or dechunked data?)

Right, it's confusing because Apache's API auto decodes them but there's no way for app developers to know that. It was already confusing before PSGI environment either. Look at HTTP::Body, which was used by Catalyst's request handling processor. It supports chunked encoding, but does decoding when Content-Length does not exist, instead of looking at Transfer-Encoding. It is a weird heuristic to determine whether the input was chunked or not.

miyagawa commented 11 years ago

http://tools.ietf.org/html/rfc2616#section-19.4.6 It appears that if Apache/mod_perl dechunks then it should also remove the Transfer-Encoding: chunked header and add a Content-Length. Would you agree?

I wasn't aware of that pseudo code in the RFC, thanks for pointing it out. Yes I agree that is much better and transparent, for PSGI environment.

jjn1056 commented 11 years ago

Right, HTTP::Body does spool some things out to disk and the rest into memory, which means sometimes we might create more than one temp file and possible memory buffers for reading stuff from input. I do think that at the framework level (like Catalyst) we should not be caring if the the incoming is chunked or not however the current way to read all the input into a buffer can be in-effecient for certain types of operations. One thing we'd like to add to Catalyst in an upcoming release is a way to access the body filehandle for post/put and similar in a way that is both raw (like with psgix,io, should it be supported) but not so raw that we need to handle the introspecting and possible dechunking.

Ideally (or so I might be thinking, naively) there would be a way to 'tap' into the process that is to create psgi.input such that instead of everything getting written out to a buffer, we could provide a callback or file handle from the framework. Then we'd get the best of both worlds, by letting the server deal with the transport and all that (including detecting if the incoming is chunked or not) but avoid being forced to buffer everything, which is really onerous if the incoming is very large.

Although you can roll your own with psgix.io, that is a very heavy hammer to use since it means you need to completely redo a lot of stuff. Its handy for jailbreaking when you need to do long lived connection stuff like websockets, etc., but if you just want to parse a very large upload, its very unhappy.

There's some complexity here regarding multipart of course, but I think its still doable...

miyagawa commented 11 years ago

when psgi.input is buffered, most of the servers use Stream::Buffered (or Plack::TempBuffer) which saves the request body to a temp file rather than memory when the size becomes large.

I know that chunked requests are often used when sending a large file from memory restricted clients, but it sounds like whether to buffer the incoming request or not is quite orthogonal to this thread. (i.e. you could handle chunked request while saving the input into a buffer, or do streaming input as it comes)

miyagawa commented 11 years ago

There's a big thread in psgi-plack mailing list a while ago where we discussed about adding streaming input in Starman - tl;dr is that it's quite complicated to make that transparent when it comes to detect the end of file.

jjn1056 commented 11 years ago

@miyagawa yes, I see what you mean, and also some of my thoughts are not totally on topic, I just am interested because of the semi related Catalyst work I am trying to plan and because @billmoseley first mentioned it to me since he was using Catalyst and trying to solve this type of problem. Now, in the lastest version of Catalyst we expose psgix.io if it exists, and if you request that, we 'jailbreak' Catalyst and expect the caller to do everything top to bottom. However, I think in the case of Starman at least if for example there was a huge chunked POST, even if we accessed psgix.io we'd still have read in all the POST into the stream buffer. Maybe as a baby step I should open a new ticket to request that psgi.input be deferrable in some way, so that someone that wanted to carry that burden could do so... Seems the only other option it so have a stand alone web applcation running in a server directly, and just skip PSGI.

miyagawa commented 11 years ago

However, I think in the case of Starman at least if for example there was a huge chunked POST, even if we accessed psgix.io we'd still have read in all the POST into the stream buffer.

Yes, you can detect that by looking at psgix.input.buffered.

Seems the only other option it so have a stand alone web applcation running in a server directly, and just skip PSGI.

Or extend/patch/fork existing server to support non-buffered input and access to that via special key in $env while the rest is in PSGI and you can still benefit from all that - heck that was the original intention of psgix.io.

jjn1056 commented 11 years ago

From: Tatsuhiko Miyagawa notifications@github.com To: plack/Plack Plack@noreply.github.com Cc: John Napiorkowski jjn1056@yahoo.com Sent: Saturday, July 6, 2013 12:54 AM Subject: Re: [Plack] Can't get chunked requests to work (#404)

However, I think in the case of Starman at least if for example there was a huge chunked POST, even if we accessed psgix.io we'd still have read in all the POST into the stream buffer. Yes, you can detect that by looking at psgix.input.buffered. Seems the only other option it so have a stand alone web applcation running in a server directly, and just skip PSGI. Or extend/patch/fork existing server to support non-buffered input and access to that via special key in $env while the rest is in PSGI and you can still benefit from all that - heck that was the original intention of psgix.io.

I do think there is a use case here for something that is between the psgix.io "totally raw/metal, you have to parse it all" and for unbuffered input but normalized to something like a un unread filehandle, which would deal with the underlying complexities of chunked versus unchunked and whatever else that might come down the pipe at some point. I guess it would be a bit complex around multipart as well. But I think is off topic, and I will start a new thread, after I go find that streaming discussion you mentioned, and perhaps poke around the current art in Perl and other languages. I think writing a real use case would be valuable, something that is just not speculation, but an actual problem. After that I can ponder offering some sort of patch, although I doubt this is going to be as simple as 'starman --buffer.input=0'....

john —

Reply to this email directly or view it on GitHub.

plack / Plack