Closed keegancsmith closed 8 years ago
I think the performance regression is due to spinning up a new goroutine per broadcast per client. Goroutines are cheap, but not free.
There are a few idiomatic Go approaches.
One would be to use the producer/consumer model. We could spin up N worker goroutines that read off a channel of outbound messages.
Another would be to have one persistent goroutine per connection for the outbound send. This would be connected via channel.
We could also use aggressive write timeouts to alleviate the slow client problem, though that doesn't parallelize the sending of a broadcast.
I think the performance regression is due to spinning up a new goroutine per broadcast per client. Goroutines are cheap, but not free.
Yes I agree with you, this is almost certainly the reason. The cost of the goroutine may not be as bad if instead we did this test on a real network, not localhost. That would likely make this perform much better than the older implementation.
One would be to use the producer/consumer model. We could spin up N worker goroutines that read off a channel of outbound messages.
This is probably the easiest thing to implement and would be performant. I'd be interested in seeing if it improves on localhost performance given we introduce some extra synchronisation (consuming the channel).
Another would be to have one persistent goroutine per connection for the outbound send. This would be connected via channel.
There are some things to consider for that:
You can avoid the repeated marshaling of the JSON data also:
msg := &WsMsg{Type: "broadcast", Payload: payload}
data, err := json.Marshal(msg)
h.mutex.RLock()
for c := range h.conns {
if err := websocket.Message.Send(c, data); err == nil {
@jboelter will you make PR with this change?
Btw. copying (under lock) all channel pointers from map to slice and then after unlock sending data
might also speed things up.
Closing out since there is nothing to merge here.
Note: This seems to actually make it slower! Do not merge in, sending in as another datapoint.
This should help improve the performance of go in the benchmark. The only "ugly" pattern this introduces is using an atomic int increment, but this ends up simpler and more performant than the alternatives.
Benchmark on localhost: