kylemacfarlane commented 6 years ago

If you put a uWSGI websocket server using websocket_recv_nb as shown in tests/websockets_chat.py under heavy load then a couple errors start appearing. On the other hand using websocket_recv inside gevent.spawn and applying gevent monkeypatching as shown in tests/websockets_chat_2.py appears to work fine.

First of all I don't think it's caused by load per se. I think the problem is due to messages arriving at the same time which just happens more frequently under heavy load. To recreate the issue I hit the server with two messages at the same time every 100ms.

First Symptom

You will encounter "delayed" messages as reported in #1241. To solve this you can simply iterate over websocket_recv_nb until it returns None.

Second Symptom

You will start getting this error when reading from the websocket:

no PONG received in 3 seconds !!!

This was reported in #1533. How can there be no PONG response when the client and server are talking at 20 messages per second? Even aggressively reading from the websocket as much as possible doesn't solve the problem.

Third Symptom

You will start getting this error when reading from Redis:

OSError: unable to fd 12 to the event queue
epoll_ctl(): Bad file descriptor [core/event.c line 638]

This was reported on StackOverflow and the uWSGI maillist:

Other Errors

If you start playing around there are other errors that can happen. For example if you suppress the OSError from Redis then next time you try to read the websocket you will get an fd error from that as well. But the above are the three errors you will most commonly encounter if you copy the examples in the tests directory: https://github.com/unbit/uwsgi/tree/master/tests

Testing

I experimented with various settings and the following are the outcomes. I didn't try asyncio because the docs say it's experimental.

uWSGI 2.0.15
uwsgi.wait_fd_read based on tests/websockets_chat_async.py
Run with --async 100 --ugreen
Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
❌Connection usually fails within 0-2 minutes

.

uWSGI 2.0.15
uwsgi.async_sleep instead of waiting for any fds
Run with --async 100 --ugreen
Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
❌Connection usually fails within 0-2 minutes

.

uWSGI 2.0.15
gevent.select based on tests/websockets_chat.py
Run with --gevent 100 --gevent-monkey-patch
Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
❌Connection usually fails within 5-15 minutes

.

uWSGI master (f74553dbf56dcac168455cb0705a3533be94cf01)
uwsgi.wait_fd_read based on tests/websockets_chat_async.py
Run with --async 100 --ugreen
Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
❌Connection usually fails within 0-2 minutes

.

uWSGI master (f74553dbf56dcac168455cb0705a3533be94cf01)
gevent.select based on tests/websockets_chat.py
Run with --gevent 100 --gevent-monkey-patch
Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
❌Connection usually fails within 0-2 minutes
Worse than gevent on 2.0.15 because it seems to have other bugs and will encounter other random errors such as:
- KeyError: 'env'
- SystemError: <built-in function uwsgi_gevent_request> returned a result with an error set

.

uWSGI master (f74553dbf56dcac168455cb0705a3533be94cf01)
gevent.spawn and request_context loosely based on tests/websockets_chat_2.py
Use non-blocking methods uwsgi.websocket_recv_nb and redis.get_message combined with gevent.sleep
Run with --gevent 100 --gevent-monkey-patch
Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
❌Connection usually fails within 0-2 minutes

.

uWSGI master (f74553dbf56dcac168455cb0705a3533be94cf01)
gevent.spawn and request_context loosely based on tests/websockets_chat_2.py
Use blocking methods uwsgi.websocket_recv and redis.listen and hope that gevent monkeypatching prevents any actual blocks
Run with --gevent 100 --gevent-monkey-patch
Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
✔ Connection seems to survive but is it truly async?

.

Combine the two previous tests that used gevent.spawn and request_context
"Blocking" uwsgi.websocket_recv + gevent monkeypatch with non-blocking redis.get_message + gevent.sleep
✔ Connection seems to survive but is it truly async?
Non-blocking uwsgi.websocket_recv_nb + gevent.sleep with "blocking" redis.listen + gevent monkeypatch
❌Connection usually fails within 5-15 minutes
Something seems to be wrong with websocket_recv_nb.

kylemacfarlane commented 6 years ago

Bonus: this seems to have also fixed #1602 (which may be the same as #1623).

1602 is about sending single large responses over `websocket_send`. At first glance that seems like the exact opposite problem of receiving multiple small requests but it's easy to see how it can become the same.

A large response that takes a long time to send is more likely to overlap with other requests or responses. Thus again it's probably not heavy load per se but multiple messages being received or sent at almost the same time which causes all these bugs.

kylemacfarlane commented 6 years ago

I noticed in #1602 I said it only happens behind nginx. That's not true for this bug (#1716) which happens with and without nginx.

marc1n commented 5 years ago

In above (failed) tests did you run uWSGI with --http, --http-socket or --socket?

kylemacfarlane commented 5 years ago

In the tests I was hitting uWSGI directly so it would most likely have been --http.

In production it's behind nginx and uses --socket.

The problem appears in both places when using websocket_recv_nb.

marc1n commented 5 years ago

How did you run the tests? Could you share a testing script?

kylemacfarlane commented 5 years ago

The problem is at least partially repeatable with the demos in the tests folder.

Using this demo: https://github.com/unbit/uwsgi/blob/master/tests/websockets_chat_async.py

First you need to update it for current versions of uwsgi (change websocket_handshake) and then add a setInterval to repeatedly send the message: https://gist.github.com/kylemacfarlane/841efeec92d63d2396dc0cab0d2a4e85

Note that for this demo I had to set the interval to 5ms / 200rps whereas in my real project 50ms / 20rps is enough. Probably something to do with message sizes and/or network latency.

Run it with:

uwsgi --shared-socket 0.0.0.0:443, --https =0,cert,key --http-websockets --wsgi-file websockets_chat_async.py --async 100 --ugreen

Access the page and type something in the message box so that it starts sending. Within a few minutes (perhaps even seconds) the websocket connection will drop with Bad file descriptor either from wait_fd_read or uwsgi_response_write_body_do.

marc1n commented 5 years ago

I am interested in gevent version so I run server app: uwsgi --http :12345 --http-websockets --wsgi-file websockets_chat.py --gevent-monkey-patch --gevent 100 and in Chrome browser in console I run: setInterval(invia, 5)

I don't observe any errors after some minutes of running this test (over 10000 websocket messages so far)...

marc1n commented 5 years ago

I run this version of uWSGI:

*** Starting uWSGI 2.0.17.1 (64bit) on [Mon Nov 19 10:16:02 2018] ***
compiled with version: 5.4.0 20160609 on 06 September 2018 13:40:59
os: Linux-4.15.0-39-generic #42-Ubuntu SMP Tue Oct 23 15:48:01 UTC 2018

kylemacfarlane commented 5 years ago

gevent.select on stable is the hardest method to make fail. In the tests above I noted that it takes 5-15 minutes at 20rps over the real network so in a simple test on the same computer you'd probably have to leave it running for ages.

unbit / uwsgi

websocket_recv_nb fails under heavy load but websocket_recv is fine #1716

First Symptom

Second Symptom

Third Symptom

Other Errors

Testing

1602 is about sending single large responses over `websocket_send`. At first glance that seems like the exact opposite problem of receiving multiple small requests but it's easy to see how it can become the same.

unbit / uwsgi

websocket_recv_nb fails under heavy load but websocket_recv is fine #1716

First Symptom

Second Symptom

Third Symptom

Other Errors

Testing

1602 is about sending single large responses over websocket_send. At first glance that seems like the exact opposite problem of receiving multiple small requests but it's easy to see how it can become the same.

1602 is about sending single large responses over `websocket_send`. At first glance that seems like the exact opposite problem of receiving multiple small requests but it's easy to see how it can become the same.