lesismal / nbio

Pure Go 1000k+ connections solution, support tls/http1.x/websocket and basically compatible with net/http, with high-performance and low memory cost, non-blocking, event-driven, easy-to-use.
MIT License
2.11k stars 151 forks source link

perf: improve the performance of sending data #434

Closed limpo1989 closed 1 month ago

limpo1989 commented 1 month ago

When data reading occurs, there is a high probability of data writeback generated in the OnData. At this time, it is not necessary to wait for the registration write event callback to immediately perform the write operation, which can improve the performance of sending data.

benchmark use https://github.com/lesismal/go-websocket-benchmark

BEFORE

----------------------------------------------------------------------------------------------------
20240523 22:34.41.956 [BenchEcho] Report

|    Framework     |  TPS   |  EER   |   Min   |   Avg   |   Max    |  TP50   |  TP75   |  TP90   |  TP95   |  TP99   | Used  |  Total  | Success | Failed | Conns | Concurrency | Payload | CPU Min | CPU Avg | CPU Max | MEM Min | MEM Avg | MEM Max |
|     ---          |  ---   |  ---   |   ---   |   ---   |   ---    |   ---   |   ---   |   ---   |   ---   |   ---   |  ---  |   ---   |   ---   |  ---   |  ---  |     ---     |   ---   |   ---   |   ---   |   ---   |   ---   |   ---   |   ---   |
| nbio_nonblocking | 687201 | 909.76 | 29.69us | 14.51ms | 183.09ms | 12.32ms | 14.93ms | 20.63ms | 22.74ms | 63.07ms | 2.91s | 2000000 | 2000000 |   0    | 10000 |    10000    |  1024   |  0.00   | 755.36  | 1189.81 | 76.83M  | 80.48M  | 84.12M  |
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
20240523 22:34.41.962 [BenchRate] Report

|    Framework     | Duration | EchoEER | Packet Sent | Bytes Sent | Packet Recv | Bytes Recv | Conns | SendRate | Payload | CPU Min | CPU Avg | CPU Max | MEM Min | MEM Avg | MEM Max |
|     ---          |   ---    |   ---   |     ---     |    ---     |     ---     |    ---     |  ---  |   ---    |   ---   |   ---   |   ---   |   ---   |   ---   |   ---   |   ---   |
| nbio_nonblocking |  10.00s  | 1575.93 |  18275930   |   17.43G   |  18275930   |   17.43G   | 10000 |   200    |  1024   |  0.00   | 1159.69 | 1327.80 | 124.71M | 137.38M | 152.60M |
----------------------------------------------------------------------------------------------------

AFTER

----------------------------------------------------------------------------------------------------
20240523 22:39.11.089 [BenchEcho] Report

|    Framework     |  TPS   |  EER   |   Min   |   Avg   |   Max    |  TP50   |  TP75   |  TP90   |  TP95   |  TP99   | Used  |  Total  | Success | Failed | Conns | Concurrency | Payload | CPU Min | CPU Avg | CPU Max | MEM Min | MEM Avg | MEM Max |
|     ---          |  ---   |  ---   |   ---   |   ---   |   ---    |   ---   |   ---   |   ---   |   ---   |   ---   |  ---  |   ---   |   ---   |  ---   |  ---  |     ---     |   ---   |   ---   |   ---   |   ---   |   ---   |   ---   |   ---   |
| nbio_nonblocking | 707134 | 968.17 | 23.30us | 14.09ms | 101.21ms | 12.52ms | 15.67ms | 20.39ms | 21.32ms | 23.33ms | 2.83s | 2000000 | 2000000 |   0    | 10000 |    10000    |  1024   |  0.00   | 730.39  | 1209.78 | 91.67M  | 94.35M  | 97.02M  |
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
20240523 22:39.11.096 [BenchRate] Report

|    Framework     | Duration | EchoEER | Packet Sent | Bytes Sent | Packet Recv | Bytes Recv | Conns | SendRate | Payload | CPU Min | CPU Avg | CPU Max | MEM Min | MEM Avg | MEM Max |
|     ---          |   ---    |   ---   |     ---     |    ---     |     ---     |    ---     |  ---  |   ---    |   ---   |   ---   |   ---   |   ---   |   ---   |   ---   |   ---   |
| nbio_nonblocking |  10.00s  | 1419.24 |  18467060   |   17.61G   |  18375638   |   17.52G   | 10000 |   200    |  1024   | 1279.64 | 1294.75 | 1308.90 | 136.39M | 145.51M | 151.48M |
----------------------------------------------------------------------------------------------------
lesismal commented 1 month ago

OnData usually handles parsing, when we got a frame/message in OnData, we handle the frame/message async in another goroutine, which may not finish writing when OnData for next fd is called in the eventloop.

when:

  1. len(writeList) == 0: this fast flush doesn't cost much;
  2. len(wirteList) > 0: because the new writing maybe not finished as I explained above, so this may be the old data from old writing. at the beginning of this time of loop, there has check the write event, since still data waiting to be sent, that means this fd is still not writable. Then if we flush we'll waste more syscall.

The benchmark results are not different too much, and can not prove the promotion enough. One more thing is, in the real scenarios over the public internet, TCP windows size communication is slower than in the local test env.

limpo1989 commented 1 month ago

In LT mode, if there is no writeable event at present, it does not mean that the socket cannot be written, because the write event registered this time actually needs to wait until the next round of epoll-wait to be triggered.

Perhaps this change will improve intranet services, such as the rpc framework serving intranets. 🤣

lesismal commented 1 month ago

In LT mode, if there is no writeable event at present, it does not mean that the socket cannot be written

No matter which epoll mod, It may be writable at the next nano when the epoll_wait returns and before we check the event type, but for most of the time, if there's no write event, it means not writable at this moment. If we don't do flush according to the write event, then we don't need to check the write event and the kernal doesn't need to provide the write event anymore, because the fd maybe writable at any moment. :joy: