the-benchmarker / web-frameworks

Which is the fastest web framework?
MIT License
6.99k stars 670 forks source link

How to rank frameworks #238

Closed waghanza closed 6 years ago

waghanza commented 6 years ago

Hi,

As a standard tool, we decide to go on wrk.

This benchmarking tool give us a lot of useful informations.

ALL informations SHOULD be displayed, but COULD be used to determine ranks (I mean display is OK, but to keep it simple now we SHOULD only use 1 metric to rank).

@OvermindDL1 What do you think about talking only the number of requests-per-second to rank ?

Regards,

OvermindDL1 commented 6 years ago

All of these values are very useful in different situations, so optimally a section for each sort perhaps with a header jumping to each?

waghanza commented 6 years ago

sure. I think those informations are useful. However, they SHOULD be displayed, not imperatively on the main table, but somewhere ...

In order to keep simple, I suggest choosing 3 or 4 metrics to display on main table :

OvermindDL1 commented 6 years ago

Hmm, for active display, perhaps:

Honestly errors probably don't need to be shown, if it's anything but 0 ever then a test is broken and it shouldn't be posted anyway

waghanza commented 6 years ago

@tbrand To simplify reading, I think we should display only ONE table of results on README.md (sorted by rank)

@tbrand @OvermindDL1 I have push new results => https://github.com/waghanza/which_is_the_fastest/blob/wrk/README.md#result, computed using our 3 rules

aichholzer commented 6 years ago

It's interesting to see how these results can change, last time this file was updated, rayo was doing ~87K reqs/s and polka around ~86K reqs/s. Now they have both dropped by +10K reqs/s each and the order has shifted.

What kind of consistency are you providing/considering for these tests? Consistently, in all my tests rayo is still faster.

waghanza commented 6 years ago

Hi @aichholzer, We are adjusting tests parameters, duration, request number ... I hope to release stable results soon

OvermindDL1 commented 6 years ago

@waghanza I'm not sure that adjusting such things should cause such shifting results, that implies to me the server it is running on is either running other code, the network is chatty, it's a cloud machine or VM, etc... etc... My results were fairly stable, less than 1% variance between runs.

OvermindDL1 commented 6 years ago

On the last results it shows this:

OS: Linux (version: 4.16.11-100.fc26.x86_64, arch: x86_64)
CPU Cores: 8
threads: 9, requests: 100000.0

Is that the wrk server or is it the server that the servers are running on? Are they running on the same machine (in which case that thread count is super WRONG)? Why is the request count so so very low as many of the servers will not even last a second with such a small amount), and why is it not time based instead of request based (that way you can ensure that the servers warm up properly, especially important for JIT'd languages) and let the siege spool up fully (which with only 100000 requests is too short for many). Also why is it reporting 100000.0 requests when you can only send an integral number of requests? Is the CPU Cores count the number of physical cores or the number of hyperthreads?

waghanza commented 6 years ago

@OvermindDL1 sure, I know it ...

However, I think we SHOULD NOT stopping to run tests ... I know that your 16 cores server is waiting ... I'm trying to find a solution to automate all this things (my idea is to compute results and to display in multi-hardware context, like tfb does, physical, cloud, but in way that COULD be done by others)

btegs commented 6 years ago

What's the easiest way to get wrk installed on Ubuntu 18.04 with dependencies and such?

aichholzer commented 6 years ago

@btegs

sudo apt-get install build-essential libssl-dev git -y
git clone https://github.com/wg/wrk.git wrk
cd wrk
sudo make
# move the executable to somewhere in your PATH, ex:
sudo cp wrk /usr/local/bin
waghanza commented 6 years ago

@tbrand Thinking about it, I am not sure that ranking frameworks has to be done.

Actually, we display informations about ranking only from the req/s perspective

However, each frameworks has (at least) 3 kind of information :

@OvermindDL1 what do you think ?

OvermindDL1 commented 6 years ago

req/s and latency are definitely big ones to show, as for throughput in relation to wireless and such I think is less big, rather the size of each request where it explicitly lists that this is with their default header set could be useful to list if you want to go that way (easily tested by doing a request on a server and just seeing the total size returned). For throughput to be a useful bit of information overall each server would need to return the exact same information, which is doable though.

waghanza commented 6 years ago

sure, we could optimize all this frameworks (for throughput), but is this very what developers does ?

as I am a developer, I prefer (mostly) use out-of-the-box things (not customising ...) and will it be fare to show results after tweaking (not default ones)

OvermindDL1 commented 6 years ago

Heh, exactly my point, thus throughput is probably not useful to rank on then as they are not testing the same thing.

waghanza commented 6 years ago

@OvermindDL1 that's why is propose to remove ranking from README, or having a strong ranking (e.g : using at least req/s and latency) for example, akkahttp is 16/40 but has a huge latency, more than perfect (which is 17/40)

Language (Runtime) Framework (Middleware) Requests / s Latency 99 percentile Throughput
scala akkahttp 59853.33 210753.00 4777700.33 47.02 MB
swift perfect 56951.00 17329.67 24624.33 16.87 MB
OvermindDL1 commented 6 years ago

The actor model tends to have a high latency in exchange for throughput, the BEAM languges (erlang/elixir) tend to do the same. But yeah, a strong ranking like that is good, graphs especially to give more visual details.

I was surprised that akka was that high when I saw it though, I wondered why but not hooked up a profiler as of yet... ^.^;

waghanza commented 6 years ago

@OvermindDL1 I think we can remove this ranking (on the actual README), in order to open a clean PR for this

waghanza commented 6 years ago

@tbrand what do you think ?

tbrand commented 6 years ago

@waghanza Ranking is most valuable metrics for this repository. I guess that people who give us starts want to know the simple results, not for complex profilings.

Of course we can show the detail results (expanded markdown for ideal) but I want to show a simple ranking at first view as I said at somewhere.

waghanza commented 6 years ago

@tbrand sure, I'm total up this idea However, the actual ranking, is not accurate (based only on req/s).

For example, with the battle on frameworks in :

actual ranking shows :

  1. node
  2. php
  3. python
  4. ruby

but, when taking req/s + latency, we have :

  1. node
  2. ruby
  3. python
  4. php

My point is NOT to definitively remove the ranking, but prevent showing nasty results, either by removing (temporarily) from the README or using latency

I do not have personally enough time to deal with this.

I think, we SHOULD remove rankin+ having a banner onREADME` that explain why not having ranking infos ?

What do you think ?

tbrand commented 6 years ago

What req/s + latency means? Just sum up them? For users, req/sec is the most important metrics. Not latency and throughput.

waghanza commented 6 years ago

@tbrand Yes, this example what just a poor / basic operation

rank => (+)req/s (-) latency

latency is important since it represent time between request and response

tbrand commented 6 years ago

But it's duplicated since req/sec includes the latency metrics. As you says it's too poor since the unit of them are different. (req/sec is [scaler/sec], latency is [sec])

waghanza commented 6 years ago

@tbrand req/s is only https://github.com/wg/wrk/blob/master/SCRIPTING#L108

what I have understood from last weeks (working on this project)

@OvermindDL1 Am I right ?

PS : For me ranking SHOULD refer to what consumer feels when using an app

tbrand commented 6 years ago

What? So the unit of req/sec is scaler? The name of req/sec is completely wrong!

OvermindDL1 commented 6 years ago

Incorrect:

req/sec: Is the number of completed requests per second, as in open the connection, set up the TCP state, send headers, received data back, (traditionally closing the connection is not counted and I don't think it is here either). latency: Is the amount of time it took to complete a request on average, from connecting to receiving the completed data (which is a single packet each way after setup in these simplified tests). This is why it has things like 99th percentile and so forth as things like 99th percentile shows what the slowest 1% of the slowest requests take, and 99.99th percentile shows what the 0.01% of the slowest requests take, which are highly useful metrics for showing worst-case timings (hence traditionally you have 50, 75, 90, 99, 99.99, 99.9999 or so for the percentiles).

Subtracting latency from req/sec is not a useful metric as they have different magnitudes.

OvermindDL1 commented 6 years ago

Thus a library that can do 500k req/sec at an average latency of 300ms and a 99th percentile latency of 1200ms, and another library that can do 200k req/sec at an average latency of 20ms and a 99th percentile latency of 80ms have super different results depending on what you are optimizing for. In general the one with higher latency but higher req/sec is better when you want to optimize for massive connection counts on a single server, and the one with lower req/sec but super low latency is fantastic for low-rate servers that need to return data super-fast (think of an advert server). (And of course some languages/frameworks are great at both.)

waghanza commented 6 years ago

@OvermindDL1 is right, sorry for mis-leading words

waghanza commented 6 years ago

@OvermindDL1 @tbrand So the question is : which metric(s) SHOULD be used to represent to represent the rank (which context) ?

For me, but perhaps because of my knowledge, I think we want to rank in a web context (obviously back). So, I think, taking only req/s is NOT ideal / accurate

waghanza commented 6 years ago

@tbrand If we refer to initial goal, in the README

Measuring response times

I think ranking by latency is more relevant than by req/s

OvermindDL1 commented 6 years ago

Ah, I missed that, if Measuring response times is paramount, than latency is what is wanted (at the very least average latency and, say, 99th or 99.99th percentiles or just average and deviation).

waghanza commented 6 years ago

@tbrand We can also display two tables on README :

The fact is we are comparing web tools. For me, response time (latency) is an important aspect when we are in a so-called back-end perspective (API), and request time (req/s) is an important aspect when we are in a so-called front-end perspective.

@OvermindDL1 That's why, I think, display two tables (+ explanation) will be accurate

waghanza commented 6 years ago

@OvermindDL1 just for information

With req/s, I found :

With latency, I found :

OvermindDL1 commented 6 years ago

Makes sense, the C++ library I was using uses a deferred networking system behind it (libevent), so it tries to maximize throughput to handle the maximum number of connections possible via batching calls. I'm guessing the average Rust library does the opposite, but it would be interesting to see individual projects. Tradeoffs everywhere and it's the kernel calls and such that determines how it will be tuned (with language overhead on top of course). :-)

waghanza commented 6 years ago

Actix (Rust) use the same system than evhtp, but the Rust framework that has the best rank on latency is nickel ;-)

It is this astonishing that Ruby (roda) is better that C# or java

OvermindDL1 commented 6 years ago

Ruby is very built for low-latency because of how long it's code can take to run, but that means that it can't overlap calls easily, hence low latency but horrible req/sec.

waghanza commented 6 years ago

sure, but it make me asking myself if having only low latency is the only metric we should use for back-end stuff

some back-end stuff, like API in python are faster than ruby ones however, in result japronto has a latency of 21209 but roda has a latency of 3166

waghanza commented 6 years ago

@OvermindDL1 @tbrand I was thinking of a display, like this

Ranked by latency | Language (Runtime) | Framework (Middleware) | Requests / s | Latency | 99 percentile | Throughput | |---------------------------|---------------------------|----------------:|----------------:|----------------:|-----------:| | ruby | roda | 38079.00 | 1735.67 | 27182.00 | 12.67 MB | | ruby | rack-routing | 31005.33 | 2121.33 | 29419.33 | 5.65 MB | | ruby | flame | 19051.33 | 3362.00 | 36238.00 | 3.50 MB | | ruby | sinatra | 15901.00 | 4026.67 | 45941.33 | 13.54 MB | | python | japronto | 82996.67 | 11618.67 | 13816.00 | 33.43 MB | | ruby | rails | 4154.67 | 15418.00 | 99327.00 | 3.77 MB | | node | polka | 77130.67 | 19931.33 | 323806.67 | 36.62 MB | | node | fastify | 66254.67 | 23568.67 | 375535.33 | 65.65 MB | | node | rayo | 64014.67 | 23681.33 | 396522.67 | 31.35 MB | | node | express | 47145.33 | 46062.67 | 913263.33 | 39.29 MB | | python | flask | 18397.00 | 56160.00 | 155783.00 | 14.98 MB | | python | sanic | 14197.67 | 67893.33 | 132685.00 | 8.62 MB | | python | django | 10958.00 | 94362.33 | 420568.00 | 10.30 MB | | php | symfony | 42371.00 | 169928.00 | 2860775.33 | 70.48 MB | | php | laravel | 30946.33 | 309246.00 | 4397344.67 | 49.90 MB | | python | tornado | 1878.33 | 528993.67 | 3523509.00 | 1.34 MB |
Ranked by requests per seconds | Language (Runtime) | Framework (Middleware) | Requests / s | Latency | 99 percentile | Throughput | |---------------------------|---------------------------|----------------:|----------------:|----------------:|-----------:| | python | japronto | 82996.67 | 11618.67 | 13816.00 | 33.43 MB | | node | polka | 77130.67 | 19931.33 | 323806.67 | 36.62 MB | | node | fastify | 66254.67 | 23568.67 | 375535.33 | 65.65 MB | | node | rayo | 64014.67 | 23681.33 | 396522.67 | 31.35 MB | | node | express | 47145.33 | 46062.67 | 913263.33 | 39.29 MB | | php | symfony | 42371.00 | 169928.00 | 2860775.33 | 70.48 MB | | ruby | roda | 38079.00 | 1735.67 | 27182.00 | 12.67 MB | | ruby | rack-routing | 31005.33 | 2121.33 | 29419.33 | 5.65 MB | | php | laravel | 30946.33 | 309246.00 | 4397344.67 | 49.90 MB | | ruby | flame | 19051.33 | 3362.00 | 36238.00 | 3.50 MB | | python | flask | 18397.00 | 56160.00 | 155783.00 | 14.98 MB | | ruby | sinatra | 15901.00 | 4026.67 | 45941.33 | 13.54 MB | | python | sanic | 14197.67 | 67893.33 | 132685.00 | 8.62 MB | | python | django | 10958.00 | 94362.33 | 420568.00 | 10.30 MB | | ruby | rails | 4154.67 | 15418.00 | 99327.00 | 3.77 MB | | python | tornado | 1878.33 | 528993.67 | 3523509.00 | 1.34 MB |
OvermindDL1 commented 6 years ago

Looks good, though I'd recommend putting what the value is in latency, I'm guessing it is microseconds(µs)?

waghanza commented 6 years ago

@OvermindDL1 good catch, but I prefer display in ms, I find it more human-readable

OvermindDL1 commented 6 years ago

@OvermindDL1 good catch, but I prefer display in ms, I find it more human-readable

Likewise, but the values that are there now did not seem to be that, thus conversion necessary. :-)

lukeed commented 6 years ago

TBH, I didn't follow the full thread, but this seems like the appropriate place to leave this concern:

The frameworks should all be deployed on a DigitalOcean droplet with 1 CPU only. A lot of these frameworks (or the languages themselves) naturally take advantage of all available cores, while most others require manual setup / a flag to enable such behavior.

For example, Crystal & Node.js require manual process forking. Rust frameworks are variable, but most will automatically spread across cores. And I believe go/fasthttp fans out across cores as well.

The point is that deploying to a single-core server server to normalize all test subjects without any code changes. And when interpreting results, it's much easier to reason about since everything is forced to be single-core. You get a nice unit of comparison.

The wrk runner can & should run on a separate machine.

waghanza commented 6 years ago

Hi @lukeed,

You have 2 suggestions.

  1. Deploy to 1 CPU only droplets

Honestly, I didn't get the point. It's a language advantage to use all cores / CPUs. I do not understand why not taking count on this ?

And, for example, pm2 is in charge of spreading node process here :stuck_out_tongue:

  1. wrk

Sure, the metrics shown on README.md are computed from docker host. The metrics will be computed on cloud, of course with a remote sieger

lukeed commented 6 years ago
  1. Yes, but if you are running a single-threaded language, you almost always change the deployment configuration to account for it. So that, instead of paying for an 8-core machine (and only using 1), you either (A) manually fork the process, if possible; (B) use a tool like pm2 to fork for you, if available; (C) or load balance a cluster of 8 x 1CPU machines

My point is that it's unfair to report requests/sec & latency for single-core vs multi-core languages without normalizing the behaviors. As mentioned, some frameworks within the same language have different approaches to multi-threaded processing. Running everything on a single core is a good unit of measure & is the easiest way to normalize across all behaviors.

At the end of the day, multi-core server is a convenience. Any server build on any language/framework can utilize 8 cores one way or another. It's just a matter of whether it's built-in and done-for-you or whether it is not. That's why it's unfair to compare 8 against 1.

waghanza commented 6 years ago

@lukeed We use pm2

Here a list of app server in use :

All app server spread on all cores, it's not a framework feature, but an environment behavior

The point you reveal is right, this is not fair to compare single thread / core languages with multi-cores ones ... However, it's not faire to downgrade multi-core / threads.

I will say that metrics are important, but their are only metrics. We must rank depending on use case, that why I propose to split ranks, latency for back-end stuff and req/s for front-end stuff.

I think it's not fair neither to compare node and haskell, the only comparison is about performance not language, and performance depend on use case

Even if some languages can not utilize all cores, it's a choice of language design (it COULD be either a pro or a con, depending of use case)

lukeed commented 6 years ago

I fully understand. But as a quick retort: Your shortlist is missing Crystal & Rust. And, as a final point, you also introduce overhead which (of course) detracts some % from the overall performance. For example, since long-running uptime doesn't matter for a 5 minute benchmark, you could use this instead of pm2 & the memory usage and latency would improve:

const { cpus } = require('os');
const cluster = require('cluster');
const app = require('./app');

if (cluster.isMaster) {
  let i=0, cores=cpus().length;
  for (; i < cores; i++) {
    cluster.fork();
  }
} else {
  app.listen(3000);
}

The same thing (barring different syntax) can be done in Rust & Crystal.

This is the other approach to "downgrading" the multi-core languages. The trouble is it's more work & requires more knowledge to do it correctly per language (and per framework).

FWIW, I keep saying "per framework" because there are Node.js & Rust frameworks that do the cluster.forking snippet above.

Of course, you're more than welcome to disregard all of this & keep reporting how you'd like 😅

If that's the case, my final request would be that you report the "per-thread" result from wrk and not the totals/averages.

waghanza commented 6 years ago

Your shortlist is missing Crystal & Rust

Rust is actually the first (actix), and crystal is down the list (due to dockerization) if you want to add other language, feel free to PR

And, as a final point, you also introduce overhead which (of course) detracts some % from the overall performance

Sure, I will take inspiration with tfb to add a more stable results (has a warm up phase before running severals *phases**)

The same thing (barring different syntax) can be done in Rust & Crystal

You mean spreading on all cores ?

This is the other approach to "downgrading" the multi-core languages

How ?

lukeed commented 6 years ago

Yes, actix is perfect example. It has multi-threaded built into the framework. But if you used tokio-minihttp, another popular Rust server, it is single-threaded only & you (the developer) are supposed to handle the multi-threading yourself.

Effectively, actix includes my snippet above & tokio-minihttp doesn't. These are just two names, but this happens all the time across Node & Rust -- and I'm starting to see it in Crystal too.

So now you, the benchmark maintainer, need to know of these differences and actively maintain them, or request that PRs that introduce them as a test subject need to do their homework before merging. That's a lot more work!

This is why I initially suggested running all languages / frameworks on a single-core machine. It means you don't have to worry about all the details of each framework. All single-core languages can be clustered to operate as a multi-cored cluster. Similarly, all multi-core languages can operate on a single core just fine. (Most multi-core-oriented languages deal with threads/coroutines anyway, and so are not limited to number of physical cores on the machine.)

As a final note, here are some Polka (Node.js) numbers on my shitty laptop:

As you can see, pm2 includes overhead that can be measured. You can't and don't want to be held accountable for knowing all the best practices or tricks for each language & framework.

waghanza commented 6 years ago

As for actix, the implementation was done by @fafhrd91, I assume it is a correct as for a rust developper

pm2 is used on all cores, I assume it is considered as a best-practice in node communities.

Neither do I nor @tbrand (nor @OvermindDL1) COULD reasonably maintain ALL implementations, each developer have specialization(s).

Our goal is not to teach the word anything :stuck_out_tongue_closed_eyes: just to learn and to gather communities ;-)

For example, @thiamsantos made several implementations, the PR waiting (wai / haskell) is not mine

I understand your concern about how hard it is to maintain all this, but it not because it is hard, that it is not doable :tada:

If your have any tips or any advice, we are open

The same retort COULD be saied about https://www.techempower.com/benchmarks and any other benchmarking projects