miguelgrinberg / simple-websocket

Simple WebSocket server and client for Python.
MIT License
78 stars 17 forks source link

next_ping time can become invalid for high volume data #16

Closed Mitch-Martinez closed 2 years ago

Mitch-Martinez commented 2 years ago

I have recently had an issue where my client connection closes after finishing sending a very high volume of data to a simple-websockets server.

I have identified that the ping packets do not get sent during this period of high volume data, this is expected and I think that this just helped me identify where the issue is. For example, say I have my ping interval set to 5 seconds. Then a client connects and then immediately starts sending data continuously for 30 seconds. You would expect no ping packets sent in that entire 30 seconds. After the client finished sending, the server disconnected it because it did not send the pong back fast enough. Usually the client would have the time set by ping_interval to pong back to the server, but because no pings were sent for 30 seconds the next_ping time is actually 25-30s behind and it causes the sel.select(next_ping - now) call to return None almost immediately if the client is not fast enough to respond to the ping.

The next_ping value usually never gets backed up like this but in the above situation, lower latency networks do not have the chance to pong back.

next_ping should not be incremented by the ping_interval but should instead be assigned to now + self.ping_interval to always ensure that the next_ping is always expected to be in the future.

miguelgrinberg commented 2 years ago

I have identified that the ping packets do not get sent during this period of high volume data, this is expected

Why is this expected? Pings are sent from a background thread, so they should happen regardless of the traffic. Aren't you confusing pings with pongs? The server sends pings, the client responds with pongs. Maybe you are saying that your client is unable to send pongs because it is busy sending regular data to the server?

But in any case, I do agree that the next_ping calculation would be more robust if it is always based on the current time. Would you like to send a PR with this change, or do you prefer that I make it myself?

Mitch-Martinez commented 2 years ago

Hmm, well I may just have completely misunderstood when pings should be sent from the server in the _thread(self) method, I assumed with the way it was written that pings would be sent on only two conditions because of this call if not sel.select(next_ping - now)

  1. The ping_interval had been reached and
  2. There is not data to be read from the socket.

I was planning on doing the PR but my computer with the repo on it is acting up so yes I would appreciate if you could make this change yourself, I apologize for not having a PR ready to review.

miguelgrinberg commented 2 years ago

Okay, I'll review the logic in the thread. The intention was that pings should be inserted at the correct time or as closely as possible, regardless of traffic.

Mitch-Martinez commented 2 years ago

I put in some print statements every time a ping was sent and saw that they were pretty much getting passed over because there was always something to be read from the socket even if the timeout had been reached and select returned immediately.

When can I expect a new release of simple-websockets? I'm using flask-sock for a server and it would be beneficial to me to be able to get an updated version of simple-websockets to avoid patching the code. This really is an edge case but I would expect it to come up every once in a while.

Thanks for your help and knowledge on all of this.

EDIT: Adding some info to this. After I started actually monitoring the pings I have noticed that sometimes the pings happen when they should and sometimes they are "passed over" like I explained earlier. I guess it mostly just depends on how active that socket is and it can get into a case where the pings are not sent like they are expected.

miguelgrinberg commented 2 years ago

@Mitch-Martinez you were right, there was a small logic bug, pings could be delayed until the input buffer was empty. I have now corrected it so that pings are emitted even if the other side is sending constantly.

Can I ask you to install the main branch of this repo and test on your side to confirm everything works correctly now? Thanks.

Mitch-Martinez commented 2 years ago

Just finished testing, seems to work as expected. Pings are sent at the expected interval and the next_interval has been corrected. Thank you for taking a closer look at all of this, I appreciate your time on it.

Mitch-Martinez commented 2 years ago

Hmm seems I didn't do thorough enough testing, sometimes I'm seeing it close the connection after sending a ping. Let me make sure it's not an issue with something I am doing.

Mitch-Martinez commented 2 years ago

Unfortunately there is still an issue brought on by my first suggestion of using next_ping = now + self.ping_interval. This can cause unwanted issues if sel.select(next_ping - now) does timeout. it should instead probably be next_ping = time() + self.ping_interval

So in the case of sending no data it has the opportunity to still race the client and close if it does not receive a pong in enough time.

For example I put in some print statements to keep track of the next_ping, now, and time() values. These were placed before checking if the ping was received on line 144 of ws.py

if next_ping <= now or not sel.select(next_ping - now):
    print(f"next_ping {next_ping}")
    print(f"now       {next_ping}")
    print(f"time()    {next_ping}")

What I noticed was below ... (ping is sent after timeout since no traffic has been sent)

next_ping 1658253777.4230623
now       1658253752.4308243
time()    1658253777.4500685

... (The ping pong logic is triggered again almost immediately)

next_ping 1658253777.4308243
now       1658253777.4509404
time()    1658253777.4510117

... (Connection closed because no pong was received)

So there's still a way to get invalid next_ping values when the call to select blocks. I apologize for suggesting an incorrect solution.

miguelgrinberg commented 2 years ago

Okay, yeah, the current solution actually works well when the system is busy sending lots of messages, but using now breaks when there is not much activity. I was trying to avoid calling the OS again to get the time, but maybe that's the correct solution anyway.

Maybe this?

next_ping = max(now, next_ping) + self.ping_interval
miguelgrinberg commented 2 years ago

@Mitch-Martinez Please retest with current main branch code. Thanks!

miguelgrinberg commented 2 years ago

This is now released in v0.7.0. If there are any remaining issues please open a new ticket.