start_next_cycle() results in h11._util.LocalProtocolError: not in a reusable state

alanbchristie commented 5 years ago

A LocalProtocolError("not in a reusable state") exception is regularly raised from the start_next_cycle() call in hypercorn.asyncio.H11Server.recycle_or_close().

Could this be a state problem in hypercorn? A similar issue affected some example code of the h11 project (https://github.com/python-hyper/h11/issues/70). I'm seeing a lot of these exceptions (10 in a 22 hour period) but I'm not sure what the impact is on my code as I've not finished investigating the side-effect. At the moment I don't think I've lost any data but what's causing the state exception?

asyncio ERROR # Exception in callback H11Server.recycle_or_close(<Task finishe...> result=None>)
handle: <Handle H11Server.recycle_or_close(<Task finishe...> result=None>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.7/site-packages/hypercorn/asyncio/h11.py", line 103, in recycle_or_close
    self.connection.start_next_cycle()
  File "/usr/local/lib/python3.7/site-packages/h11/_connection.py", line 204, in start_next_cycle
    self._cstate.start_next_cycle()
  File "/usr/local/lib/python3.7/site-packages/h11/_state.py", line 298, in start_next_cycle
    raise LocalProtocolError("not in a reusable state")
h11._util.LocalProtocolError: not in a reusable state

I'm using hypercorn 0.6.0 and it's pulled in h11 0.8.1

pgjones commented 5 years ago

This is likely when a connection is kept-alive and re-used - I think it would result in the connection closing (potentially during usage). It is hard to know what might cause this though, are there any clues in the requests arriving before these errors?

njsmith commented 5 years ago

It would be helpful to know the value of self.connection.states when the error happens.

alanbchristie commented 5 years ago

I don't think I'm losing any data but the log has a significant number of these exceptions. Is there any existing log I can enable that might help?

alanbchristie commented 5 years ago

I've made a temporary (local) patch to the underlying h11 module and the exception's happened twice in the last few minutes. The client/server states (at the time of failure) are: -

self.states[CLIENT]=SEND_BODY
self.states[SERVER]=DONE

The CLIENT state of SEND_BODY is the one upsetting h11.

pgjones commented 5 years ago

Ah, the issue here is that Hypercorn will happily send the response if instructed to do so before the request body has been full received. This is allowed in RFC 7230 (section 6.5) if the response is an error code, but rarely implemented.

I'll take a look at other servers, I'm uncertain whether the server or application should do the waiting for the body. It may be something that should be specified in ASGI.

njsmith commented 5 years ago

Definitely do look at other servers to see what the state of the art is. But I think the way this supposed to work is:

The server is free to send a response at any time; it doesn't have to read the request first
But it does have to read the request body before it can reuse the connection for another request (keepalive mode). So if there ASGI app sends an early response and then terminates, the ASGI server has either read and discard the response body, or else go ahead and close the connection. Since you probably don't want to read potentially unbounded amounts of data, you probably want to close the connection.
Some clients don't start reading the response until after they've finished sending the request. If you have a client like this, and the request and the response bodies are both large, and you send an early response, you could potentially deadlock. There's not really anything an ASGI server can do to prevent this in general; the app has to be aware of the possibility and cope (e.g. by setting timeouts).
On an early response, the client can potentially lose the response if they're still sending when you close the connection.
Things that can mitigate these issues:
- Having good support for Expect: 100-continue. Does ASGI support this?
- If the server is ambitious, it can improve the chances of client seeing an early response by doing a soft close: instead of immediately closing the connection, it does shutdown(SHUT_WR), and then reads and discards data up to some limit (N bytes or M seconds?), in the hopes that the client will notice the response and the closed connection, and stop sending gracefully. Of course if it doesn't then you have to close the socket and cut it off.

njsmith commented 5 years ago

Maybe useful for intuition: the classic case of an early close is something like:

PUT /blah/blah HTTP/1.1
Content-Length: 999999999999

HTTP/1.1 403 Quota exceeded

You gotta cut this person off before they send you a terabyte of data. No way do you want to read all that in before telling them that it's too big. (Or like, if they need to give their password first, you don't want to transfer a terabyte and then tell them they need to log in and then upload it again.)

This is also the case that motivated the invention of 100 Continue – it lets a well-behaved client ask whether it should even try to upload the request body before it starts sending stuff. But not all clients support 100 Continue and a server still needs to be able to cope with this situation.

alanbchristie commented 5 years ago

In my particular case the client sends nothing to the server once connected, it just sits on a recv() call collecting messages sent to it.

The server on the other hand passes small messages (of the order hundreds of bytes) to the client every 10 minutes. It's bursty, with up to 1,000 small messages being sent to each connected client at 10-minute intervals.

njsmith commented 5 years ago

In my particular case the client sends nothing to the server once connected,

Hmm. If there is no request body and the server doesn't read it, then h11 may have all the request body events buffered up internally, and just be complaining that you're trying to start the next cycle without having consumed them.

hypercorn might want to call next_event until exhausted, before calling start_next_cycle, just to flush out anything like this. (Unless it does already.)

pgjones commented 5 years ago

@njsmith ASGI doesn't have a concept of 100-continue and Hypercorn calls next_event until exhausted whenever data is received.

@alanbchristie What framework are you using - any chance it is over counting the content-length?

I've pushed 4901eedeeb9673cfcc13068beaf479ba57a8b93a which will early close, as a workaround.

alanbchristie commented 5 years ago

The server (a mock of a more complex Java app I use) can be found at https://gitlab.com/matilda.peak/chronicler-stub.git. It's one file (chronicler.py) and exposes the web-socket via the URI ws://<blah>/channel/<N>. It's a quart app. Once a connection is made it asserts that the connecting header contains a special key and then runs a loop that tries to read from a queue either sending a message from that queue or sleeping for a short period (both using await). It essentially transmits data to the connected client, and does not expect anything in return.

The Python client (currently a private repo) uses websockets 5.0 and essentially connects to the URI with: -

async with websockets.connect(self._ws_uri, extra_headers=connect_hdr) as web_socket:

And then enters a loop receiving from the websocket: -

message = await web_socket.recv()

The client/server appear to work as expected. The original data is written to chronicler.py through REST calls to its /inbound route API, placed into a channel-specific queue and passed to the client via the call to send().

The messages are generally small (hundreds of bytes, maybe 1 or 2K at most).
All the apps run as container images in Kubernetes on an embedded ARM device using a arm32v7/python:3.7.3-alpine3.9 base image.

njsmith commented 5 years ago

ASGI doesn't have a concept of 100-continue

Interesting! When you get a request with Expect: 100-continue, do you (1) automatically send 100 Continue immediately, (2) wait until the user calls await receive(), (3) ignore it entirely, (4) something else?

In principle (2) is probably optimal, but it has some user visible effect. In particular, it means that if you send an early response, you might not be able to read the request body at all. But I guess that's not something that makes sense to allow anyway (the client may well disappear after your early response anyway). So probably ASGI should formally forbid calling receive after calling send.

ASGI should have a way for the app to send arbitrary 1xx responses, but IIRC it doesn't. Adding that would also complicate 100 Continue support, at least in that you'd have to specify how they interact...

and Hypercorn calls next_event until exhausted whenever data is received.

Interesting! That's not how I was imagining it would work, but if you're not relying on checking the h11 state attributes then I guess it works fine.

njsmith commented 5 years ago

@alanbchristie these logs would be coming from REST calls where your app doesn't consume all the data before sending a response. Maybe requests with failed authentication or something?

alanbchristie commented 5 years ago

The exceptions are in the chronicler-stub container that's receiving REST data on its /inbound handler and where data's shunted to the connected web-sockets, through connection-specific queues to the /channel/<N> handler. Each web-socket, when established, sits in the while True loop getting an event from its queue then sending or sleeping.: -

if event_message:
    await websocket.send(event_message.SerializeToString())
else:
    await asyncio.sleep(4)

Once connected, the process on the other end of the web-socket is expected to be permanent, i.e. the other end of the socket is not expected to close the connection or stop calling recv().

Here's a fragment of chronicler-stub log. There's a burst of inbound messages at 07:50 (I don't log all the messages, just every 100th). You can see the exception occurs while 100 messages are being transmitted between 07:50:38 and 07:50:47: -

It's important to understand that there are 2 web-sockets that are active - two recipients have connected - one to /channel/1 and one to /channel/2. So there are two independent queues and the 100 messages (100 in each queue) are sent to their respective web-sockets during this time.

2019-04-16T07:50:38.781Z chronicler INFO # ---- inbound
2019-04-16T07:50:38.782Z chronicler INFO # websockets: 2
2019-04-16T07:50:38.784Z chronicler INFO # 1100 in, 2200 out, 0 dropped
2019-04-16T07:50:38.786Z chronicler INFO # Current message: {'axis': '178', 'name': 'Warwick Square, Pimlico', 'bikes': '2', 'spaces': '17', 'docks': '19', 'working docks': '19', 'broken docks': '0', 'time': '2019-04-16T07:46:44Z'}
2019-04-16T07:50:46.831Z asyncio ERROR # Exception in callback H11Server.recycle_or_close(<Task finishe...> result=None>)
handle: <Handle H11Server.recycle_or_close(<Task finishe...> result=None>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.7/site-packages/hypercorn/asyncio/h11.py", line 103, in recycle_or_close
    self.connection.start_next_cycle()
  File "/usr/local/lib/python3.7/site-packages/h11/_connection.py", line 204, in start_next_cycle
    self._cstate.start_next_cycle()
  File "/usr/local/lib/python3.7/site-packages/h11/_state.py", line 298, in start_next_cycle
    raise LocalProtocolError("not in a reusable state")
h11._util.LocalProtocolError: not in a reusable state
2019-04-16T07:50:47.083Z chronicler INFO # ---- inbound
2019-04-16T07:50:47.084Z chronicler INFO # websockets: 2
2019-04-16T07:50:47.086Z chronicler INFO # 1200 in, 2400 out, 0 dropped
2019-04-16T07:50:47.087Z chronicler INFO # Current message: {'axis': '352', 'name': 'Vauxhall Street, Vauxhall', 'bikes': '15', 'spaces': '4', 'docks'

pgjones commented 5 years ago

@alanbchristie could you take a look at this with the latest master? I think it is more robust now (I've finished a fairly large refactor).

alanbchristie commented 5 years ago

Will do. I should have some time this Fri/Sa.

alanbchristie commented 5 years ago

I've embedded the latest master into my app, which has been running for about 90 minutes. It's processed 4,200 messages, passing 1,987 through to the connected websocket. That's all as expected so everything's fine from that perspective.

And...

There are no examples of the original fault so far. :-)

But ... the following has occurred twice: -

2019-06-29T12:00:47.142Z asyncio ERROR # Task exception was never retrieved
future: <Task finished coro=<worker_serve.<locals>._server_callback() done, defined at /usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0-py3.7.egg/hypercorn/asyncio/run.py:90> exception=AttributeError("'NoneType' object has no attribute 'handle'")>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0-py3.7.egg/hypercorn/asyncio/run.py", line 91, in _server_callback
    await Server(app, loop, config, reader, writer)
  File "/usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0-py3.7.egg/hypercorn/asyncio/server.py", line 98, in run
    await self._read_data()
  File "/usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0-py3.7.egg/hypercorn/asyncio/server.py", line 118, in _read_data
    await self.protocol.handle(RawData(data))
  File "/usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0-py3.7.egg/hypercorn/protocol/__init__.py", line 60, in handle
    return await self.protocol.handle(event)
  File "/usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0-py3.7.egg/hypercorn/protocol/h11.py", line 106, in handle
    await self._handle_events()
  File "/usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0-py3.7.egg/hypercorn/protocol/h11.py", line 154, in _handle_events
    await self.stream.handle(Body(stream_id=STREAM_ID, data=event.data))
AttributeError: 'NoneType' object has no attribute 'handle'

Annoyingly I discovered that I didn't update the __version__ = "0.6.0" value in your __about__.py_ so it appears as though I've been using the official 0.6.0 (which I haven't). Is it worth getting into the habit of changing the version number on your release branch (master) immediately after each official release? i.e. update master to 0.7.0-alpha or some-such now to avoid this confusion.

pgjones commented 5 years ago

Thank you! I think I've fixed this new error now, with f6d6595b6f7246f9aa26597870028e96f42718d6. Could you check with that (I've bumped to 0.6.0+dev). It also raises this issue, which I'll look into fixing next.

alanbchristie commented 5 years ago

Rebuilt with Hypercorn==0.6.0+dev but still seeing...

2019-06-30T16:40:34.096Z asyncio ERROR # Task exception was never retrieved
future: <Task finished coro=<worker_serve.<locals>._server_callback() done, defined at /usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0+dev-py3.7.egg/hypercorn/asyncio/run.py:91> exception=AttributeError("'NoneType' object has no attribute 'handle'")>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0+dev-py3.7.egg/hypercorn/asyncio/run.py", line 92, in _server_callback
    await Server(app, loop, config, reader, writer)
  File "/usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0+dev-py3.7.egg/hypercorn/asyncio/server.py", line 97, in run
    await self._read_data()
  File "/usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0+dev-py3.7.egg/hypercorn/asyncio/server.py", line 117, in _read_data
    await self.protocol.handle(RawData(data))
  File "/usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0+dev-py3.7.egg/hypercorn/protocol/__init__.py", line 60, in handle
    return await self.protocol.handle(event)
  File "/usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0+dev-py3.7.egg/hypercorn/protocol/h11.py", line 106, in handle
    await self._handle_events()
  File "/usr/local/lib/python3.7/site-packages/Hypercorn-0.6.0+dev-py3.7.egg/hypercorn/protocol/h11.py", line 156, in _handle_events
    await self.stream.handle(Body(stream_id=STREAM_ID, data=event.data))
AttributeError: 'NoneType' object has no attribute 'handle'

Just started the app, seen one so far.

alanbchristie commented 5 years ago

After a 5-hour run with 15,420 messages passed to 2 web-sockets I've seen three of the AttributeError exceptions. That's with the app running in a python:3.7.3-alpine3.10 container.

pgjones commented 5 years ago

Thanks again for testing, it is very helpful. That said I'm not sure what is causing the AttributeErrors, I'll have to think again.

alanbchristie commented 5 years ago

No problem. I don't think I'm loosing any messages but if there's more debug you need to add I can re-run it at some point but I'm not an asyncio expert (yet).

pgjones commented 5 years ago

As the error occurs when trying to send a Body event rather than a Data event, I think this is related to HTTP requests. Does your setup also handle HTTP requests (aside from the WebSocket ones)? If so I think it might be related to recycling the connection, and I've just fixed a bug which may solve this problem.

pgjones commented 5 years ago

Actually the above would require the client to be using pipelining, which seems unlikely. Must be something else.

pgjones commented 5 years ago

I think I've got this in 13bbb7faeebed7462449d751b0e198079c120755, it is a question of Hypercorn responding before the client has finished (which isn't a problem as long as Hypercorn doesn't error as the client finishes). Could you try with the latest master?

alanbchristie commented 5 years ago

I will try again tonight...

alanbchristie commented 5 years ago

Sorry - have not been to do this tonight. All sorts of trouble accessing GitLab registry. This will have to wait until the weekend when I have time to understand what's going on with my registry tokens.

Are the changes in 0.7.0?

pgjones commented 5 years ago

They are. Thanks again for trying.

alanbchristie commented 5 years ago

OK - so the new build's been running from the latest master for about 90 minutes now with 2,700 messages processed an no exceptions. I think we can declare this one fixed?

I'll rebuild my with the official release over the weekend.

Neat.

alanbchristie commented 5 years ago

...and have been running with 0.7.0 for most of the day, 28776 messages and no exceptions. Yep - we can close this defect now. Thanks.

pgjones commented 5 years ago

Excellent, thank you for the continual testing.

pgjones / hypercorn

start_next_cycle() results in h11._util.LocalProtocolError: not in a reusable state #18