Closed waghanza closed 6 years ago
All of these values are very useful in different situations, so optimally a section for each sort perhaps with a header jumping to each?
sure. I think those informations are useful. However, they SHOULD be displayed, not imperatively on the main table, but somewhere ...
In order to keep simple, I suggest choosing 3 or 4 metrics to display on main table :
Hmm, for active display, perhaps:
Honestly errors probably don't need to be shown, if it's anything but 0
ever then a test is broken and it shouldn't be posted anyway
@tbrand To simplify reading, I think we should display only ONE table of results on README.md
(sorted by rank)
@tbrand @OvermindDL1 I have push new results => https://github.com/waghanza/which_is_the_fastest/blob/wrk/README.md#result, computed using our 3 rules
It's interesting to see how these results can change, last time this file was updated, rayo
was doing ~87K reqs/s and polka
around ~86K reqs/s. Now they have both dropped by +10K reqs/s each and the order has shifted.
What kind of consistency are you providing/considering for these tests?
Consistently, in all my tests rayo
is still faster.
Hi @aichholzer, We are adjusting tests parameters, duration, request number ... I hope to release stable results soon
@waghanza I'm not sure that adjusting such things should cause such shifting results, that implies to me the server it is running on is either running other code, the network is chatty, it's a cloud machine or VM, etc... etc... My results were fairly stable, less than 1% variance between runs.
On the last results it shows this:
OS: Linux (version: 4.16.11-100.fc26.x86_64, arch: x86_64)
CPU Cores: 8
threads: 9, requests: 100000.0
Is that the wrk
server or is it the server that the servers are running on? Are they running on the same machine (in which case that thread count is super WRONG)? Why is the request count so so very low as many of the servers will not even last a second with such a small amount), and why is it not time based instead of request based (that way you can ensure that the servers warm up properly, especially important for JIT'd languages) and let the siege spool up fully (which with only 100000 requests is too short for many). Also why is it reporting 100000.0
requests when you can only send an integral number of requests? Is the CPU Cores
count the number of physical cores or the number of hyperthreads?
@OvermindDL1 sure, I know it ...
However, I think we SHOULD NOT stopping to run tests ... I know that your 16 cores server is waiting ... I'm trying to find a solution to automate all this things (my idea is to compute results and to display in multi-hardware context, like tfb does, physical, cloud, but in way that COULD be done by others)
What's the easiest way to get wrk installed on Ubuntu 18.04 with dependencies and such?
@btegs
sudo apt-get install build-essential libssl-dev git -y
git clone https://github.com/wg/wrk.git wrk
cd wrk
sudo make
# move the executable to somewhere in your PATH, ex:
sudo cp wrk /usr/local/bin
@tbrand Thinking about it, I am not sure that ranking frameworks has to be done.
Actually, we display informations about ranking only from the req/s
perspective
However, each frameworks has (at least) 3 kind of information :
req/s
latency
wireless
environments)
throuput
@OvermindDL1 what do you think ?
req/s
and latency
are definitely big ones to show, as for throughput
in relation to wireless and such I think is less big, rather the size of each request where it explicitly lists that this is with their default header set could be useful to list if you want to go that way (easily tested by doing a request on a server and just seeing the total size returned). For throughput to be a useful bit of information overall each server would need to return the exact same information, which is doable though.
sure, we could optimize all this frameworks (for throughput
), but is this very what developers does ?
as I am a developer, I prefer (mostly) use out-of-the-box things (not customising ...) and will it be fare to show results after tweaking (not default ones)
Heh, exactly my point, thus throughput
is probably not useful to rank on then as they are not testing the same thing.
@OvermindDL1 that's why is propose to remove ranking
from README
, or having a strong ranking (e.g : using at least req/s
and latency
)
for example, akkahttp
is 16/40
but has a huge latency
, more than perfect
(which is 17/40
)
Language (Runtime) | Framework (Middleware) | Requests / s | Latency | 99 percentile | Throughput |
---|---|---|---|---|---|
scala | akkahttp | 59853.33 | 210753.00 | 4777700.33 | 47.02 MB |
swift | perfect | 56951.00 | 17329.67 | 24624.33 | 16.87 MB |
The actor model tends to have a high latency in exchange for throughput, the BEAM languges (erlang/elixir) tend to do the same. But yeah, a strong ranking like that is good, graphs especially to give more visual details.
I was surprised that akka was that high when I saw it though, I wondered why but not hooked up a profiler as of yet... ^.^;
@OvermindDL1 I think we can remove this ranking (on the actual
README
), in order to open a clean PR
for this
@tbrand what do you think ?
@waghanza Ranking is most valuable metrics for this repository. I guess that people who give us starts want to know the simple results, not for complex profilings.
Of course we can show the detail results (expanded markdown for ideal) but I want to show a simple ranking at first view as I said at somewhere.
@tbrand sure, I'm total up this idea
However, the actual
ranking, is not accurate (based only on req/s
).
For example, with the battle on frameworks in :
node
ruby
php
python
actual
ranking shows :
node
php
python
ruby
but, when taking req/s
+ latency
, we have :
node
ruby
python
php
My point is NOT to definitively remove the ranking, but prevent showing nasty results, either by removing (temporarily) from the README
or using latency
I do not have personally enough time to deal with this.
I think, we SHOULD remove rankin+ having a banner on
README` that explain why not having ranking infos ?
What do you think ?
What req/s
+ latency
means? Just sum up them?
For users, req/sec is the most important metrics. Not latency and throughput.
@tbrand Yes, this example what just a poor
/ basic
operation
rank => (+)
req/s
(-)latency
latency
is important since it represent time between request and response
But it's duplicated since req/sec
includes the latency metrics.
As you says it's too poor since the unit of them are different. (req/sec
is [scaler/sec], latency
is [sec])
@tbrand req/s
is only https://github.com/wg/wrk/blob/master/SCRIPTING#L108
what I have understood from last weeks (working on this project)
req/s
refer to the number of sent requests (to the server)latency
refer to the number of received requests (that have been sent)@OvermindDL1 Am I right ?
PS : For me ranking
SHOULD refer to what consumer feels when using an app
What? So the unit of req/sec
is scaler? The name of req/sec
is completely wrong!
Incorrect:
req/sec
: Is the number of completed requests per second, as in open the connection, set up the TCP state, send headers, received data back, (traditionally closing the connection is not counted and I don't think it is here either).
latency
: Is the amount of time it took to complete a request on average, from connecting to receiving the completed data (which is a single packet each way after setup in these simplified tests). This is why it has things like 99th percentile and so forth as things like 99th percentile shows what the slowest 1% of the slowest requests take, and 99.99th percentile shows what the 0.01% of the slowest requests take, which are highly useful metrics for showing worst-case timings (hence traditionally you have 50, 75, 90, 99, 99.99, 99.9999 or so for the percentiles).
Subtracting latency from req/sec is not a useful metric as they have different magnitudes.
Thus a library that can do 500k req/sec at an average latency of 300ms and a 99th percentile latency of 1200ms, and another library that can do 200k req/sec at an average latency of 20ms and a 99th percentile latency of 80ms have super different results depending on what you are optimizing for. In general the one with higher latency but higher req/sec is better when you want to optimize for massive connection counts on a single server, and the one with lower req/sec but super low latency is fantastic for low-rate servers that need to return data super-fast (think of an advert server). (And of course some languages/frameworks are great at both.)
@OvermindDL1 is right, sorry for mis-leading words
@OvermindDL1 @tbrand So the question is : which metric(s) SHOULD be used to represent to represent the rank (which context) ?
For me, but perhaps because of my knowledge, I think we want to rank in a web context (obviously back
). So, I think, taking only req/s
is NOT ideal / accurate
@tbrand If we refer to initial goal, in the README
Measuring response times
I think ranking by latency
is more relevant than by req/s
Ah, I missed that, if Measuring response times
is paramount, than latency is what is wanted (at the very least average latency and, say, 99th or 99.99th percentiles or just average and deviation).
@tbrand We can also display two tables on README
:
latency
req/s
The fact is we are comparing web tools. For me, response time
(latency
) is an important aspect when we are in a so-called back-end perspective (API
), and request time
(req/s
) is an important aspect when we are in a so-called front-end perspective.
@OvermindDL1 That's why, I think, display two tables (+ explanation) will be accurate
@OvermindDL1 just for information
With req/s
, I found :
With latency
, I found :
Makes sense, the C++ library I was using uses a deferred networking system behind it (libevent), so it tries to maximize throughput to handle the maximum number of connections possible via batching calls. I'm guessing the average Rust library does the opposite, but it would be interesting to see individual projects. Tradeoffs everywhere and it's the kernel calls and such that determines how it will be tuned (with language overhead on top of course). :-)
Actix
(Rust
) use the same system than evhtp
, but the Rust
framework that has the best rank on latency
is nickel
;-)
It is this astonishing that Ruby
(roda
) is better that C#
or java
Ruby is very built for low-latency because of how long it's code can take to run, but that means that it can't overlap calls easily, hence low latency but horrible req/sec.
sure, but it make me asking myself if having only low latency
is the only metric
we should use for back-end
stuff
some back-end
stuff, like API in python
are faster than ruby
ones
however, in result japronto
has a latency
of 21209
but roda
has a latency
of 3166
@OvermindDL1 @tbrand I was thinking of a display, like this
Looks good, though I'd recommend putting what the value is in latency, I'm guessing it is microseconds(µs)?
@OvermindDL1 good catch, but I prefer display in ms
, I find it more human-readable
@OvermindDL1 good catch, but I prefer display in ms, I find it more human-readable
Likewise, but the values that are there now did not seem to be that, thus conversion necessary. :-)
TBH, I didn't follow the full thread, but this seems like the appropriate place to leave this concern:
The frameworks should all be deployed on a DigitalOcean droplet with 1 CPU only. A lot of these frameworks (or the languages themselves) naturally take advantage of all available cores, while most others require manual setup / a flag to enable such behavior.
For example, Crystal & Node.js require manual process forking. Rust frameworks are variable, but most will automatically spread across cores. And I believe go/fasthttp
fans out across cores as well.
The point is that deploying to a single-core server server to normalize all test subjects without any code changes. And when interpreting results, it's much easier to reason about since everything is forced to be single-core. You get a nice unit of comparison.
The wrk
runner can & should run on a separate machine.
Hi @lukeed,
You have 2 suggestions.
only
dropletsHonestly, I didn't get the point. It's a language advantage to use all cores
/ CPUs
. I do not understand why not taking count on this ?
And, for example, pm2
is in charge of spreading node
process here :stuck_out_tongue:
wrk
Sure, the metrics shown on README.md
are computed from docker
host. The metrics will be computed on cloud, of course with a remote sieger
pm2
to fork for you, if available;
(C) or load balance a cluster of 8 x 1CPU machinesMy point is that it's unfair to report requests/sec & latency for single-core vs multi-core languages without normalizing the behaviors. As mentioned, some frameworks within the same language have different approaches to multi-threaded processing. Running everything on a single core is a good unit of measure & is the easiest way to normalize across all behaviors.
At the end of the day, multi-core server is a convenience. Any server build on any language/framework can utilize 8 cores one way or another. It's just a matter of whether it's built-in and done-for-you or whether it is not. That's why it's unfair to compare 8 against 1.
@lukeed We use pm2
Here a list of app server in use :
All app server spread on all cores, it's not a framework feature, but an environment behavior
The point you reveal is right, this is not fair to compare single thread / core languages with multi-cores ones ... However, it's not faire to downgrade multi-core / threads.
I will say that metrics are important
, but their are only metrics. We must rank depending on use case, that why I propose to split ranks, latency
for back-end stuff
and req/s
for front-end stuff
.
I think it's not fair neither to compare node
and haskell
, the only comparison is about performance not language, and performance depend on use case
Even if some languages
can not utilize all cores, it's a choice of language design (it COULD be either a pro
or a con
, depending of use case)
I fully understand. But as a quick retort: Your shortlist is missing Crystal & Rust. And, as a final point, you also introduce overhead which (of course) detracts some % from the overall performance. For example, since long-running uptime doesn't matter for a 5 minute benchmark, you could use this instead of pm2
& the memory usage and latency would improve:
const { cpus } = require('os');
const cluster = require('cluster');
const app = require('./app');
if (cluster.isMaster) {
let i=0, cores=cpus().length;
for (; i < cores; i++) {
cluster.fork();
}
} else {
app.listen(3000);
}
The same thing (barring different syntax) can be done in Rust & Crystal.
This is the other approach to "downgrading" the multi-core languages. The trouble is it's more work & requires more knowledge to do it correctly per language (and per framework).
FWIW, I keep saying "per framework" because there are Node.js & Rust frameworks that do the
cluster.fork
ing snippet above.
Of course, you're more than welcome to disregard all of this & keep reporting how you'd like 😅
If that's the case, my final request would be that you report the "per-thread" result from wrk
and not the totals/averages.
Your shortlist is missing Crystal & Rust
Rust
is actually the first (actix
), and crystal
is down the list (due to dockerization
)
if you want to add other language, feel free to PR
And, as a final point, you also introduce overhead which (of course) detracts some % from the overall performance
Sure, I will take inspiration with tfb to add a more stable results (has a warm up
phase before running severals *phases**)
The same thing (barring different syntax) can be done in Rust & Crystal
You mean spreading on all cores ?
This is the other approach to "downgrading" the multi-core languages
How ?
Yes, actix
is perfect example. It has multi-threaded built into the framework. But if you used tokio-minihttp
, another popular Rust server, it is single-threaded only & you (the developer) are supposed to handle the multi-threading yourself.
Effectively, actix
includes my snippet above & tokio-minihttp
doesn't. These are just two names, but this happens all the time across Node & Rust -- and I'm starting to see it in Crystal too.
So now you, the benchmark maintainer, need to know of these differences and actively maintain them, or request that PRs that introduce them as a test subject need to do their homework before merging. That's a lot more work!
This is why I initially suggested running all languages / frameworks on a single-core machine. It means you don't have to worry about all the details of each framework. All single-core languages can be clustered to operate as a multi-cored cluster. Similarly, all multi-core languages can operate on a single core just fine. (Most multi-core-oriented languages deal with threads/coroutines anyway, and so are not limited to number of physical cores on the machine.)
As a final note, here are some Polka (Node.js) numbers on my shitty laptop:
Plain, single-core
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.90ms 1.00ms 18.86ms 69.44%
Req/Sec 3.64k 334.65 4.73k 71.53%
146229 requests in 10.10s, 13.81MB read
Requests/sec: 14475.59
Transfer/sec: 1.37MB
PM2, 6 threads
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.99ms 2.30ms 73.22ms 94.71%
Req/Sec 14.04k 1.44k 17.22k 86.00%
558666 requests in 10.00s, 52.75MB read
Requests/sec: 55848.44
Transfer/sec: 5.27MB
Native Cluster, 6 threads
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.95ms 1.93ms 44.03ms 92.54%
Req/Sec 14.80k 1.39k 20.93k 83.87%
593223 requests in 10.10s, 56.01MB read
Requests/sec: 58727.34
Transfer/sec: 5.54MB
As you can see, pm2
includes overhead that can be measured. You can't and don't want to be held accountable for knowing all the best practices or tricks for each language & framework.
As for actix
, the implementation was done by @fafhrd91, I assume it is a correct as for a rust
developper
pm2
is used on all cores, I assume it is considered as a best-practice in node
communities.
Neither do I nor @tbrand (nor @OvermindDL1) COULD reasonably maintain ALL implementations, each developer have specialization(s).
Our goal is not to teach the word anything :stuck_out_tongue_closed_eyes: just to learn and to gather communities ;-)
For example, @thiamsantos made several implementations, the PR
waiting (wai
/ haskell
) is not mine
I understand your concern about how hard it is to maintain all this, but it not because it is hard, that it is not doable :tada:
If your have any tips or any advice, we are open
The same retort COULD be saied about https://www.techempower.com/benchmarks and any other benchmarking projects
Hi,
As a standard tool, we decide to go on
wrk
.This benchmarking tool give us a lot of useful informations.
ALL informations SHOULD be displayed, but COULD be used to determine ranks (I mean display is OK, but to keep it simple now we SHOULD only use
1
metric to rank).@OvermindDL1 What do you think about talking only the number of
requests-per-second
torank
?Regards,