short reads - Githubissues

rbtcollins commented 9 years ago

So one common use with websockets is to exchange JSON docs back and forth. They are self delimiting, but doing byte-at-a-time parsing can be excruciatingly slow.

json.load does this:

return loads(fp.read(),
    cls=cls, object_hook=object_hook,
    parse_float=parse_float, parse_int=parse_int,
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

that is it reads the entire stream.

with the current stream definition folk need to do a loop on read(1), repeatedly trying to decode until they get a valid object. If we permitted the normal language read has in the io - 'up to N bytes' - then it would be safe to call read(65536) to read in a decent amount at once, and yet not block.

Lukasa commented 9 years ago

Does this fully eliminate the problem?

What is the JSON body is large such that it gets fragmented into many TCP segments, and the read is issued between the arrival of one segment and another? This doesn't really remove the need to confirm that a whole JSON body is present.

There are also problems the other way: if JSON bodies are very small and sent close together you may get multiple frames at once.

Really, what's required is a streaming JSON parser. This should be able to emit JSON tokens as they arrive. That, combined with a good non-blocking read, might do the trick.

rbtcollins commented 9 years ago

So I think it depends where we draw the parsing line - since websockets defines a framing layer, we can make it a server problem. I think we'll want short read support anyway, and short read support means folk can use the io buffer management library rather than having to invent something new.

Lukasa commented 9 years ago

Fair enough, I can agree with that.

GrahamDumpleton commented 9 years ago

The WSGI specification says about wsgi.input:

An input stream (file-like object) from which the HTTP request body bytes can be read.

People have potentially come to rely on the 'file-like object' status of it, including that read(n) will only return less than n bytes if the end of the stream has been reached. Changing the definition of it is probably ill advised.

I have investigated and implemented as an experiment the idea in mod_wsgi of a new readchunk() method on wsgi.input which had the partial read characteristics, but which would still block if no data was available in an attempt to come up with a way of having web sockets ride on top of a modified WSGI interface, after having upgraded the connection (remove the HTTP handling which in Apache consists of the HTTP_IN input filter), but came to the conclusion that trying to expand the ways data could be read from wsgi.input was just stupid and didn't provide a good solution.

Rather than trying to create a even more frankenstein version of WSGI, I believe web sockets should be handled through a separate interface altogether. That might be a low level interface where a raw socket is made available and async mechanisms are used, or a high level structured API just for web sockets. Either way, trying to merge functionality in WSGI itself to allow it is likely just to create a mess. If one accepts that, then the question of short reads for wsgi.input is moot as no change would be done to WSGI.

gvanrossum commented 9 years ago

In asyncio, we have read(n) which may return a partial read whenever it feels like (but still blocks if no data is available) and readexactly(n), which reads n bytes or raises an exception (IncompleteReadError, a subclass of EOFError).

OTOH in Python I/O streams we have read(n) which returns short reads only near EOF, and read1(n) which blocks only if no data is immediately available and then gives whatever it can from its buffer.

My personal preference is for the asyncio version, but I understand that WSGI has its own tradition. But it wouldn't be horrible to add the read1(n) interface.

On Wed, Oct 15, 2014 at 4:41 AM, Graham Dumpleton notifications@github.com wrote:

The WSGI specification says about wsgi.input:

An input stream (file-like object) from which the HTTP request body bytes can be read.

People have potentially come to rely on the 'file-like object' status of it, including that read(n) will only return less than n bytes if the end of the stream has been reached. Changing the definition of it is probably ill advised.

I have investigated and implemented as an experiment the idea in mod_wsgi of a new readchunk() method on wsgi.input which had the partial read characteristics, but which would still block if no data was available in an attempt to come up with a way of having web sockets ride on top of a modified WSGI interface, after having upgraded the connection (remove the HTTP handling which in Apache consists of the HTTP_IN input filter), but came to the conclusion that trying to expand the ways data could be read from wsgi.input was just stupid and didn't provide a good solution.

Rather than trying to create a even more frankenstein version of WSGI, I believe web sockets should be handled through a separate interface altogether. That might be a low level interface where a raw socket is made available and async mechanisms are used, or a high level structured API just for web sockets. Either way, trying to merge functionality in WSGI itself to allow it is likely just to create a mess. If one accepts that, then the question of short reads for wsgi.input is moot as no change would be done to WSGI.

— Reply to this email directly or view it on GitHub https://github.com/python-web-sig/wsgi-ng/issues/5#issuecomment-59193017 .

--Guido van Rossum (python.org/~guido)

python-web-sig / wsgi-ng

short reads #5