wrk2 produces incorrect results

berwynhoyt commented 1 year ago

As you can see in my write-up here, wrk2 can produce bad results under certain conditions. For example:

wrk2/wrk -d5 -c1000 -t250 -R 10000000 "http://localhost:8085/multiply?a=2&b=3"
Running 5s test @ http://localhost:8085/multiply?a=2&b=3
  250 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.97s     1.28s    4.93s    60.75%
    Req/Sec       -nan      -nan   0.00      0.00%
  833071 requests in 1.25ms, 223.25MB read
  Socket errors: connect 0, read 0, write 0, timeout 29
Requests/sec: 663801593.63
Transfer/sec:    173.72GB

See how I specified 250 threads, and a 5s test? Well, it did create 833071 requests in 5s, but as you see, it thinks it did it in 1.25ms, producing a ridiculous figure of 663 million requests/sec.

It doesn't always think it's finished in milliseconds. Sometimes it is more like 1s and other times closer to 5.

You can check out my repository that uses wrk2 here if you want to reproduce the bug.

kchgoh commented 12 months ago

I'm also evaluating which load tool to use, so I'm glad to have come across your write-up and finding. Just wondering about 2 points:

Is there any reason to use a high number of threads? Unless I'm misunderstanding something, the load on the server is determined by the total number of connections (1000 in the example), instead of by number of the threads; and I think ideally the number of threads should not exceed the number of cpu cores by too much, to minimise context switch. Would be interested to know how it behaves if you use, say 4 threads but keeping 1000 connections?
Is there any reason to run just a short 5s test? I don't know if it's relevant to this, but in the Readme it mentions:

It's important to note that wrk2 extends the initial calibration period to 10 seconds (from wrk's 0.5 second), so runs shorter than 10-20 seconds may not present useful information

I checked both wrk2 and wrk's documentation and couldn't seem to find what the calibration is for though.

berwynhoyt commented 12 months ago

Good questions.

Re (1), there is no reason to use so many connections except that the bogus results became most apparent when I did. Note that I found the most reliable results when I set #threads == #connections. My own project found that between 10 and 40 produced the maximum number of requests.

Re (2), I did not try that same test with a 10s period. I will do so now, on your prompting:

wrk2/wrk -d10 -c10 -t10 -R 10000000 "http://localhost:8085/multiply?a=2&b=3"

I get much more reasonable results, though they still range between 200,000 and 500,000 requests, which is 2 to 4 times what I get with any other tool, so I think they're still not correct.

berwynhoyt commented 12 months ago

In that last 10-second test, the problems still seems to be the time it thinks it took to finish, which ranges from 3 to 10s (when it actually took 10s).

kchgoh commented 10 months ago

Not sure if this topic is still of interest... recently I had some time to read the source code, I think below might be an explanation:

There is a "calibration" period every time a test is started. The period is hard coded as: 10 seconds + number of connections 5 millis (`uint64_t calibrate_delay = CALIBRATE_DELAY_MS + (thread->connections 5);). The 90th percentile latency received during that period is used to determine the sampling interval used to collect data for the summary stats. (long double interval = MAX(latency * 2, 10)`)
Once after the calibration period ends and it determined the sampling interval to use, it would then clear the latency values collected during the calibration period, and start collecting from scratch.

So in the case of using 1000 connections, the test duration should ideally be > 15 sec, otherwise it would still in the middle of the calibration period. I found it's more reliable to test with a duration of 60s.

Searching wrk's issue discussion, it seems wrk used to have this calibration period too, but then it was removed around 2018 (https://github.com/wg/wrk/issues/280#issuecomment-359228266) . If I find some time I'll try to remove it in my local build and see if it allows running with short test duration.

berwynhoyt commented 10 months ago

If you are able to improve this, that would be FAB!

giltene / wrk2

wrk2 produces incorrect results #138