golang: Use a non-blocking send in broadcast

keegancsmith commented 8 years ago

Note: This seems to actually make it slower! Do not merge in, sending in as another datapoint.

This should help improve the performance of go in the benchmark. The only "ugly" pattern this introduces is using an atomic int increment, but this ends up simpler and more performant than the alternatives.

Benchmark on localhost:

./websocket-bench broadcast ws://127.0.0.1:3000/ws --concurrent 10 --sample-size 100 --step-size 500 --limit-percentile 95 --limit-rtt 500ms

# Before
clients:   500    95per-rtt: 137ms    min-rtt:   9ms    median-rtt:  50ms    max-rtt: 185ms
clients:  1000    95per-rtt: 188ms    min-rtt:  22ms    median-rtt: 110ms    max-rtt: 387ms
clients:  1500    95per-rtt: 375ms    min-rtt:  35ms    median-rtt: 152ms    max-rtt: 607ms
clients:  2000    95per-rtt: 443ms    min-rtt:  51ms    median-rtt: 210ms    max-rtt: 674ms
clients:  2500    95per-rtt: 584ms    min-rtt:  72ms    median-rtt: 284ms    max-rtt: 915ms

# Before less allocations
clients:   500    95per-rtt: 102ms    min-rtt:  15ms    median-rtt:  52ms    max-rtt: 178ms
clients:  1000    95per-rtt: 274ms    min-rtt:  21ms    median-rtt: 102ms    max-rtt: 366ms
clients:  1500    95per-rtt: 308ms    min-rtt:  40ms    median-rtt: 148ms    max-rtt: 790ms
clients:  2000    95per-rtt: 440ms    min-rtt:  29ms    median-rtt: 195ms    max-rtt: 657ms
clients:  2500    95per-rtt: 619ms    min-rtt:  41ms    median-rtt: 258ms    max-rtt: 801ms

# After
clients:   500    95per-rtt: 134ms    min-rtt:  18ms    median-rtt:  75ms    max-rtt: 176ms
clients:  1000    95per-rtt: 285ms    min-rtt:  12ms    median-rtt: 141ms    max-rtt: 331ms
clients:  1500    95per-rtt: 485ms    min-rtt:  29ms    median-rtt: 215ms    max-rtt: 730ms
clients:  2000    95per-rtt: 678ms    min-rtt:  29ms    median-rtt: 330ms    max-rtt: 842ms

# After less allocations
clients:   500    95per-rtt: 165ms    min-rtt:   5ms    median-rtt:  70ms    max-rtt: 210ms
clients:  1000    95per-rtt: 266ms    min-rtt:  38ms    median-rtt: 153ms    max-rtt: 330ms
clients:  1500    95per-rtt: 375ms    min-rtt:  59ms    median-rtt: 248ms    max-rtt: 430ms
clients:  2000    95per-rtt: 517ms    min-rtt:  88ms    median-rtt: 285ms    max-rtt: 831ms

jackc commented 8 years ago

I think the performance regression is due to spinning up a new goroutine per broadcast per client. Goroutines are cheap, but not free.

There are a few idiomatic Go approaches.

One would be to use the producer/consumer model. We could spin up N worker goroutines that read off a channel of outbound messages.

Another would be to have one persistent goroutine per connection for the outbound send. This would be connected via channel.

We could also use aggressive write timeouts to alleviate the slow client problem, though that doesn't parallelize the sending of a broadcast.

keegancsmith commented 8 years ago

I think the performance regression is due to spinning up a new goroutine per broadcast per client. Goroutines are cheap, but not free.

Yes I agree with you, this is almost certainly the reason. The cost of the goroutine may not be as bad if instead we did this test on a real network, not localhost. That would likely make this perform much better than the older implementation.

One would be to use the producer/consumer model. We could spin up N worker goroutines that read off a channel of outbound messages.

This is probably the easiest thing to implement and would be performant. I'd be interested in seeing if it improves on localhost performance given we introduce some extra synchronisation (consuming the channel).

Another would be to have one persistent goroutine per connection for the outbound send. This would be connected via channel.

There are some things to consider for that:

Should it be a buffered or an unbuffered channel? (If buffered, how much?)
How to communicate back if the send errored. (I assume we send a channel to reply on)
How to do the cleanup nicely. Currently having a RW lock prevents closing a connection that we may be broadcasting to.

jboelter commented 8 years ago

You can avoid the repeated marshaling of the JSON data also:

    msg := &WsMsg{Type: "broadcast", Payload: payload}
    data, err := json.Marshal(msg)

    h.mutex.RLock()
    for c := range h.conns {
        if err := websocket.Message.Send(c, data); err == nil {

tumdum commented 8 years ago

@jboelter will you make PR with this change?

Btw. copying (under lock) all channel pointers from map to slice and then after unlock sending data might also speed things up.

keegancsmith commented 8 years ago

Closing out since there is nothing to merge here.

hashrocket / websocket-shootout

golang: Use a non-blocking send in broadcast #2