socketry / async

An awesome asynchronous event-driven reactor for Ruby.
MIT License
2.1k stars 86 forks source link

Performance issues with Fiber scheduler when overloading. #105

Closed jsaak closed 3 years ago

jsaak commented 3 years ago

I tested some ruby TCP server solutions and, benchmarked the results:

https://github.com/jsaak/ruby3-tcp-server-mini-benchmark

Using Async the latency was below 1ms which is good, however when using the async scheduler it became much slower:

Latency (wrk -t3 -c6)      Avg     Stdev     Max   +/- Stdev
async-scheduler.rb:     3.03ms   14.63ms 217.58ms   94.82%
async.rb:               0.20ms    0.04ms   0.86ms   81.46%
ioquatix commented 3 years ago

Probably the reason for this is due to some of the overhead from the way Ruby's I/O works in 3.0 and we hope to improve this for 3.1 - however if you want to confirm this, can you please benchmark it using perf diff.

jsaak commented 3 years ago

it is specific with your implementation, the libev implementation is not affected:

libev-scheduler.rb      0.07ms   39.43us   0.89ms   88.08%
ioquatix commented 3 years ago

Okay, I will take a look, thanks for the details.

ioquatix commented 3 years ago

Thanks for creating this benchmark, it's quite a useful one.

I have to normalise the numbers against simple.c implementation.

On my desktop, simple.c gives me:

> wrk -t1 -c1 http://localhost:9090
Running 10s test @ http://localhost:9090
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    15.07us   28.08us   1.96ms   98.81%
    Req/Sec    23.27k   779.55    24.63k    66.00%
  231547 requests in 10.00s, 10.16MB read
Requests/sec:  23154.44
Transfer/sec:      1.02MB

Currently, the new implementation of async gives me:

> wrk -t1 -c1 http://localhost:9090
Running 10s test @ http://localhost:9090
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    50.32us   93.07us   2.32ms   97.60%
    Req/Sec    13.31k   415.33    13.93k    74.26%
  133767 requests in 10.10s, 5.87MB read
Requests/sec:  13245.11
Transfer/sec:    595.00KB

However, this is using the IO wrappers. I want to try with native IO.

ioquatix commented 3 years ago

I need to check this more, but the initial result for using native IO within Async:

> wrk -t1 -c1 http://localhost:9090
Running 10s test @ http://localhost:9090
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    34.12us  119.76us   5.81ms   97.67%
    Req/Sec    17.73k   654.55    18.59k    82.18%
  178210 requests in 10.10s, 7.82MB read
Requests/sec:  17645.03
Transfer/sec:    792.65KB

It's about 80% of C implementation.

#!/usr/bin/env ruby

require_relative 'lib/async'

require 'socket'

Async do |task|
    server = TCPServer.new('localhost', 9090)

    loop do
        client, address = server.accept

        task.async do
            client.recv(1024)
            client.send("HTTP/1.1 204 No Content\r\nConnection: close\r\n\r\n", 0)
            client.close
        end
    end
end
ioquatix commented 3 years ago

Increasing the concurrency gives us an improved throughput.

Here is simple.c:

> wrk -t3 -c6 http://localhost:9090
Running 10s test @ http://localhost:9090
  3 threads and 6 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    78.90us  197.55us   5.73ms   97.13%
    Req/Sec    19.79k     1.28k   22.89k    69.31%
  596464 requests in 10.10s, 26.17MB read
Requests/sec:  59055.53
Transfer/sec:      2.59MB

vs async

> wrk -t3 -c6 http://localhost:9090
Running 10s test @ http://localhost:9090
  3 threads and 6 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   189.14us  138.12us   3.59ms   93.74%
    Req/Sec     9.04k   602.45    10.55k    68.65%
  272489 requests in 10.10s, 11.95MB read
Requests/sec:  26979.59
Transfer/sec:      1.18MB

I think we need some more work on this. I'll check what perf says. However even being 2x slower than C is pretty good.

jsaak commented 3 years ago

These numbers seems much better than what I measured. Good work. I recommend taking a look at libev_scheduler. With sub millisecond max latency it is quite impressive. But 3-5ms is a sane number, i think most applications can work with that.

ioquatix commented 3 years ago

For this small benchmark, max latency should be < 1ms.

Actually, our implementation of the event scheduler should be slightly more efficient than libev. However, my computer is quite old (Intel 4770).

Also, I have some issues with how wrk measures latency. I think we need to see a histogram to understand latency better.

ioquatix commented 3 years ago

By the way, the above number is generated by new io_uring implementation. It has synchronous operation, I think we can find some improvements by better management of sq.

ioquatix commented 3 years ago

Okay, I have taken simple.c and made a direct comparison with our new event handling library:

https://github.com/socketry/event/runs/2538704259?check_suite_focus=true#step:6:23

We could consider adding the others too. This is running as part of GitHub actions so we can keep track of it over time.

I hope going forward we will attain close to the performance of the C implementation.