nicolasff / webdis

A Redis HTTP interface with JSON output
https://webd.is
BSD 2-Clause "Simplified" License
2.82k stars 307 forks source link

WebSocket improvements #199

Closed jessie-murray closed 2 years ago

jessie-murray commented 2 years ago

As discussed earlier, here are some tests for the WebSocket feature and several changes that I had to make to get them to pass.

I started with the websockets library as suggested, but it was highly dependent on asyncio and had some expectations that I didn't think followed the WS RFCs. For example, it expects the Sec-WebSocket-Accept header to come immediately after Upgrade and Connection, even though I could not find any requirement saying this had to be the case. I still made this change in case some Webdis clients use this library, but for the tests themselves I found it easier to use websocket-client instead.

The changes:

  1. Change order of headers
  2. Make Origin and Sec-WebSocket-Origin optional, as this is not required for non-browser clients.
  3. I noticed that calling ws.close() in the test would take about 3 seconds to return, so called close() on the server side when we detect it.
  4. These disconnections were logged at a high level so I changed that too.
  5. There were a few bugs in the reporting thread added at the last minute to tests/websocket.c, these are fixed now.
  6. I added a new WS-only test that uses JSON and "raw" commands.

I started adding those to build.yml for the GitHub Action, but I didn't know if it would be ok since the feature is disabled by default so I left them out for now.

jessie-murray commented 2 years ago

I added a pub-sub test as discussed in the previous pull request, and unfortunately it fails often due to the order of received messages not matching the order in which they were sent. There is one WS subscriber listening to 2 channels and one WS publisher writing 8 messages to each channel, and the messages are often received out of order.

I left the test disabled by default but it can be turned on with PUBSUB=1:

PUBSUB=1 ./ws-tests.py
..received: {'SUBSCRIBE': ['message', '94e58cf3-e45c-4c9f-8478-3aeb818532cc', 'message-0']}
received: {'SUBSCRIBE': ['message', '40f5db94-72ea-445a-9823-3eeb6f629a24', 'message-0']}
received: {'SUBSCRIBE': ['message', '94e58cf3-e45c-4c9f-8478-3aeb818532cc', 'message-1']}
received: {'SUBSCRIBE': ['message', '40f5db94-72ea-445a-9823-3eeb6f629a24', 'message-1']}
received: {'SUBSCRIBE': ['message', '94e58cf3-e45c-4c9f-8478-3aeb818532cc', 'message-2']}
received: {'SUBSCRIBE': ['message', '40f5db94-72ea-445a-9823-3eeb6f629a24', 'message-2']}
received: {'SUBSCRIBE': ['message', '94e58cf3-e45c-4c9f-8478-3aeb818532cc', 'message-3']}
received: {'SUBSCRIBE': ['message', '40f5db94-72ea-445a-9823-3eeb6f629a24', 'message-3']}
received: {'SUBSCRIBE': ['message', '94e58cf3-e45c-4c9f-8478-3aeb818532cc', 'message-4']}
received: {'SUBSCRIBE': ['message', '40f5db94-72ea-445a-9823-3eeb6f629a24', 'message-5']}
received: {'SUBSCRIBE': ['message', '94e58cf3-e45c-4c9f-8478-3aeb818532cc', 'message-5']}
received: {'SUBSCRIBE': ['message', '40f5db94-72ea-445a-9823-3eeb6f629a24', 'message-4']}
received: {'SUBSCRIBE': ['message', '94e58cf3-e45c-4c9f-8478-3aeb818532cc', 'message-7']}
received: {'SUBSCRIBE': ['message', '40f5db94-72ea-445a-9823-3eeb6f629a24', 'message-6']}
received: {'SUBSCRIBE': ['message', '94e58cf3-e45c-4c9f-8478-3aeb818532cc', 'message-6']}
received: {'SUBSCRIBE': ['message', '40f5db94-72ea-445a-9823-3eeb6f629a24', 'message-7']}
F..
======================================================================
FAIL: test_publish_subscribe (__main__.TestPubSub)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./ws-tests.py", line 161, in test_publish_subscribe
    self.assertEqual(received_per_channel[channel][i], expected,
AssertionError: 'message-5' != 'message-4'
- message-5
?         ^
+ message-4
?         ^
 : In 40f5db94-72ea-445a-9823-3eeb6f629a24: expected at offset 4 was "message-4", actual was: "message-5"

----------------------------------------------------------------------
Ran 5 tests in 0.023s

FAILED (failures=1)

The test also passes often, I don't know exactly how often it fails but it's a lot, maybe 20-50% of the time in my attempts. I've always seen it propagate all messages tho, so at least they are not dropped.

Both WebSockets and pub-sub have custom code that's different than regular command processing in Webdis, so tracking this down could take some time. I haven't looked at the pub-sub code yet.

nicolasff commented 2 years ago

hmm yes this test makes the issue pretty obvious. I don't remember enough about this code to have even a hunch of what might be happening, although I suspect this has to do with the way events are queued separately using libevent (one for message 4, then one for message 5, but then libevent processes the second one first). If that's the case it could "just" be solved by keeping a list of frames to send and appending to it. I find it surprising that messages would be re-ordered and not just dropped by an overwrite (e.g. message-4 comes from redis, is wrapped into a frame and added to the http_client, then message-5 overwrites it). This would be a more likely bug, but clearly this isn't the case here.

This shows that the WS feature is not quite ready to be enabled by default yet.

jessie-murray commented 2 years ago

The main problem with WebSockets in Webdis is the use of events in client.c (using worker_monitor_input etc) which makes it difficult to separate WebSocket handling from regular HTTP. These new commits introduce a new ws_client object which is attached to the http_client, transferring all the processing of client events to websocket.c. One major issue I ran into was the duplicate registering of events in libevents, which can make them trigger in an infinite loop which uses a lot of CPU. Now ws_client uses event_base_once to register events and two fields to keep track of whether an event is already registered or not (scheduled_read and scheduled_write). Each ws_client also has its own persistent redisAsyncContext pointer instead of blocking a pool connection. During testing I also noticed the Python websocket-client library sending "connection close" frames so I added support for those, and also for "ping" frames.

For testing I ran ws-tests.py with PUBSUB=1 and used we websocket client in tests. Valgrind does not report any leaks.

There are still at least two more changes left to add before I would consider this stable: the "new" Redis connection needs to have called back before the WS messages can be processed, and the HTML demo should have a pub/sub section.

jessie-murray commented 2 years ago

Last commits for today: Webdis is now logging commands coming from WebSocket clients, a bug in pool_on_disconnect is fixed, Redis errors are sent to WebSocket clients, and the HTML demo now has a pub/sub example. I think that's all for this pull request, which is already too long. @nicolasff let me know if you feel like doing a walkthrough on Tuple again.

jessie-murray commented 2 years ago

I changed the calls to ntohll and htonll in the last commit, these are not portable. I've replaced them with two calls to ntohl/htonl which is more readable than the previous macros and should work everywhere.

Test Plan:

All of these succeeded with a simple make clean all && ./webdis.

I also ran them under Valgrind, and although they all pass, there was still a part of the Valgrind output that I found a bit concerning (the "still reachable" section):

==10== LEAK SUMMARY:
==10==    definitely lost: 0 bytes in 0 blocks
==10==    indirectly lost: 0 bytes in 0 blocks
==10==      possibly lost: 1,232 bytes in 20 blocks
==10==    still reachable: 19,174 bytes in 115 blocks
==10==         suppressed: 0 bytes in 0 blocks

It looks like the memory reported as still reachable is due to the Redis connections created by redisAsyncConnect for each new WebSocket client. I first tracked all calls to ws_client_new and ws_client_free and these were called in the expected order. I also added some logging at the main allocation/free points in hiredis and saw all of the memory being freed, so my best guess is that these are due to some internal data structures of libevent that may have grown at runtime but are not actually still usable or still pointing to memory that has clearly been freed.

As I was exploring the hiredis code I noticed that the version embedded in Webdis is pretty old, and the recent hiredis code looks significantly different. It might be worth looking into upgrading at some point.

nicolasff commented 2 years ago

Thanks for the details! I ran through a similar test plan and more, also testing it on ARM with no issues (ARM Cortex-A57, little endian like all the machines I have access too but often different enough to surface portability issues).

I had actually tried upgrading hiredis a few months back and gave up due to incompatibility issues I never managed to figure out. I might give it a shot again soon.

I've been through these changes several times now and no longer have any suggestions (thanks for the updates!). With websockets still disabled by default I think this is safe to merge. I'll try coming up with a checklist of what remains before this feature can be turned on by default, I think that means at least:

  1. Upgrading hiredis
  2. Consolidating the tests to run all of them with a single command
  3. Running the tests while webdis runs under valgrind
  4. Adding docs

And probably more.

Thanks again for this PR, this looks great.