gorilla / websocket

Package gorilla/websocket is a fast, well-tested and widely used WebSocket implementation for Go.
https://gorilla.github.io
BSD 2-Clause "Simplified" License
22.4k stars 3.48k forks source link

[BUG] Clients get disconnected due to SetWriteDeadline and large workload #885

Closed d-Rickyy-b closed 10 months ago

d-Rickyy-b commented 10 months ago

Is there an existing issue for this?

Current Behavior

Hi there and thank you for creating this library. I think this is not a regular bug report but rather a question.

Introduction

I am currently working on a project that sends out data to an arbitrary number of clients via websockets (https://github.com/d-Rickyy-b/certstream-server-go). I thought it is a good idea to put up a writeDeadline (SetWriteDeadline) for websocket writes, because I wouldn't want a client to block my server by spawning lots of connections without consuming the data. A WriteDeadline makes sure that writes don't take extremely long and hence don't take up system resources.

Problem description

I noticed that some clients get disconnected if they can't keep up with the pace of the server. My tool processes about 250/300 messages every second which it sends out to each client via their websocket connection. This seems to be because Writes are buffered and the deadline measures the time between a call to Write and the actual sending of the message on the network. With a large enough buffer and a slow enough client, this could happen very often for me. But I am randomly guessing here.

Internally I am working with channels so I added a buffer for each client to compensate for short network issues. In one part of my app I iterate over all my clients and push the data to their dedicated channel.

for _, c := range bm.clients {
    select {
        case c.broadcastChan <- data:
        default:
            // Default case is executed if the client's broadcast channel is full.
            c.skippedCerts++
        }
    }
}

more code

The client itself has a go routine running in the background that handles all incoming messages of a (buffered) channel.

func (c *client) broadcastHandler() {
    for message := range c.broadcastChan {
        c.conn.SetWriteDeadline(time.Now().Add(5 * time.Second))
        w, err := c.conn.NextWriter(websocket.TextMessage)
        // ...
        _, writeErr := w.Write(message)
        // ...
    }
}

more code

I limited the rate at wich my client processes messages to simulate a bad network connection or a slow programming language (like python). Then I noticed that my broadcastHandler processed hundreds of messages per second (1636 messages) while the actual websocket client only received 107 messages in the same time. This seems wrong but might be explained by the writeBuffer used in the websocket library.

I documented the issue here.

Questions

I appreciate your help and thoughts.

Best regards, Rico

d-Rickyy-b commented 10 months ago

The websocket connection code is "synchronous" in that writes to the websocket connection block until the message data is written to the operating system network connection.

Thanks that helps me understand the problem better. So I guess the OS happily fills its buffers with hundreds of messages until that buffer is full. Yet the deadline starts "counting" and after 5 seconds (in my case) where the application can't pass data to the OS, it "fails". I am still wondering why it takes 5 seconds (or longer) to pass the message to the OS. Maybe some weird OS logic in how buffers are handled?

Without the write deadline, a client can keep the connection alive without ever reading data.

Yeah, that's what I thought. I might want to change the behavior to actively send out pings from the server side.

Then I noticed that my broadcastHandler processed hundreds of messages per second (1636 messages) while the actual websocket client only received 107 messages in the same time. This seems wrong but might be explained by the writeBuffer used in the websocket library.

In this code:

select {
    case c.broadcastChan <- data:
    default:
        // Default case is executed if the client's broadcast channel is full.
        c.skippedCerts++
    }
}

data is dropped when the client's broadcast channel is full.

Maybe I did not explain that part well enough. Yes, that is expected. data should get dropped. To further clarify: The "skippedCerts" is not the same metric I used for measuring the "1636 messages" I mentioned in my original message. After playing around a little more I can confirm that it's actually a buffer of the websocket (or operating system as you said) that's holding all these messages. When I set a breakpoint at the right position I can see that messages are still being sent, even though there are no new conn.Write() calls.

If the client is slow reading the data, but not so slow as to trip the per message deadline set in broadcastHandler, then messages will be discarded without any errors reported.

That's my goal. I want them to be discarded in my code without tripping the deadline. Do you know the best approach for that?

d-Rickyy-b commented 10 months ago

If the goal is to discard messages sent faster than the client can receive the messages, then the broadcaster's select with default accomplishes that goal.

Well yes, but with the exception that still the writeDeadline is tripping. Each message is about 1500 bytes in size. I thought that within the 5 seconds margin there will be enough space in the buffers for the OS to accept new ones.

The five second deadline may be too early

Yeah, that's probably the case. I tried with longer deadlines. For 60 seconds I did not have any issues so far, but with 10 seconds I also ran into problems with the deadline tripping.

Do you have evidence that the write deadline is tripped? The broadcast handler ignores errors.

Yes. Locally I have error checking in place (actually I just pushed the changes).

2023/12/16 21:04:24 client.go:62: Error while closing: write tcp 127.0.0.1:8080->127.0.0.1:56117: i/o timeout
2023/12/16 21:04:24 client.go:40: Closing broadcast handler for client: 127.0.0.1:56117
2023/12/16 21:04:24 client.go:116: Disconnecting client 127.0.0.1:56117!

the broadcast handler should handle errors by closing the network connection

I noticed that when I dived down the rabbit hole of weird issues with clients not receiving messages any more and changed that already, but thanks for noticing :)

d-Rickyy-b commented 10 months ago

Even with a 30 seconds deadline the deadline is tripped after about 10 minutes or so. When using a 60 second timeout, i did still did not come across any issues so far.

d-Rickyy-b commented 10 months ago

Other than the issue with the ping and the read deadline, the code looks good. Is there anything else to resolve?

No thank you very much for your time and the details. It really helped. I will set the writeDeadline to something like 60, which seemed to work flawlessly.