Closed ignoramous closed 6 months ago
Do you have a reproducible benchmark we can use?
Undici fetch isn't as fast as the ones in browsers like Firefox / Deno / Workers.
I took a look at the comment in hackernews and the comment was not including benchmarks either.
I'm prone to close this unless somebody would like to prepare a cross-runtime benchmark suite.
Fair. I'll take a stab at extracting out the parts from my server to demonstrate this. I have to note that the slowness I observed was in a production env (running in a VM in a container behind a network load balancer) that also sees a tonne of other IO.
I wrote up a tiny DoH load-tester with python and golang (don't ask) which I intend to publish as a github action by tomorrow. In the meanwhile, here's a run from my laptop:
deno/runs:3748 | |||||||||
---|---|---|---|---|---|---|---|---|---|
percentile | p10 | p50 | p75 | p90 | p95 | p99 | p99.9 | p99.99 | p100 |
milliseconds | 280 | 296 | 309 | 336 | 411 | 694 | 889 | 898 | 929 |
undici/runs:3826 | |||||||||
---|---|---|---|---|---|---|---|---|---|
percentile | p10 | p50 | p75 | p90 | p95 | p99 | p99.9 | p99.99 | p100 |
milliseconds | 1140 | 1666 | 2171 | 2811 | 3302 | 3941 | 4501 | 4540 | 4995 |
node:naive-h2/runs:3791 | |||||||||
---|---|---|---|---|---|---|---|---|---|
percentile | p10 | p50 | p75 | p90 | p95 | p99 | p99.9 | p99.99 | p100 |
milliseconds | 1713 | 2067 | 2570 | 3276 | 3975 | 4269 | 4336 | 4530 | 4534 |
After fighting a bit with github-actions, a poor man's profiler on is finally up:
From the current run, here's Node: | node/runs:1200 | ||||||||
---|---|---|---|---|---|---|---|---|---|
percentile | p10 | p50 | p75 | p90 | p95 | p99 | p99.9 | p99.99 | p100 |
milliseconds | 532 | 1215 | 1665 | 3359 | 3469 | 20953 | 21094 | 21125 | 21182 |
here's Deno: | deno/runs:6300 | ||||||||
---|---|---|---|---|---|---|---|---|---|
percentile | p10 | p50 | p75 | p90 | p95 | p99 | p99.9 | p99.99 | p100 |
milliseconds | 87 | 97 | 104 | 111 | 116 | 127 | 165 | 184 | 189 |
These are measured from Date.now
wrapping around the fetch
call. Both local and remote caches are unlikely to interfere because every DoH query is of form <random-str>.dnsleaktest.com
.
Anything in the serverless-dns
code-base that may significantly interfere with timing fetch
calls is disabled.
The profiler forwards as many DoH requests it can for 50s
to 60s
, after which it quits.
Deno's serveDoh doesn't slow down at all, and blazes through 6300+ requests. Node's serverHTTPS (+ undici), otoh, manages 1200+ requests. I must note that the final batch of 100 queries served after the ~1100th request, took 20s+ to complete.
Just a note: I haven't used Node's built-in perf-module to profile since that is un-importable on Deno (from what I can tell)... hand-rolled a minimal and inefficient perf-counter on my own, instead.
I might be a bit lost on your codebase, but I could not find undici being used anywhere.
I would note that there are plenty of things happening on that module. How have you identified the problem in undici fetch implementation?
If by any chance you are testing the http/2 implementation, consider that https://github.com/serverless-dns/serverless-dns/blob/b36a9cdb5786b78a7dd8e22a2f0a767e9e340ab1/src/plugins/dns-operation/dnsResolver.js#L398 is highly inefficient as you will be incurring in significant overhead of creating those http/2 connections. You have to use a connection pool.
I might be a bit lost on your codebase, but I could not find undici being used anywhere.
Dang, we moved to node-fetch
(to see if it was any better) and didn't revert back to undici
(which we now have).
Here's results from Node + undici
run:
node/undici/runs:1300 | |||||||||
---|---|---|---|---|---|---|---|---|---|
percentile | p10 | p50 | p75 | p90 | p95 | p99 | p99.9 | p99.99 | p100 |
milliseconds | 200 | 602 | 965 | 1121 | 3083 | 12267 | 20400 | 20407 | 20467 |
How have you identified the problem in undici fetch implementation?
What's measured is just this method dnsResolver.js:resolveDnsUpstream
, which doesn't do much other than construct a Request
and call cache-api
(does not exist on Node, disabled either way on these "profiler" runs), fetch
(over undici
on Node), or doh2
(unused).
We then rolled our own DNS over UDP
/ DNS over TCP
connection-pooled transport atop Node (dgram
and net
), and that showed considerable perf improvement (around 15x to 100x):
node/udp/runs:6600 | |||||||||
---|---|---|---|---|---|---|---|---|---|
percentile | p10 | p50 | p75 | p90 | p95 | p99 | p99.9 | p99.99 | p100 |
milliseconds | 17 | 18 | 19 | 20 | 21 | 25 | 34 | 89 | 131 |
Of course, a standalone benchmark would be great to have, but I wanted to demonstrate the issue I spot with the kind of workloads we specifically run (lots and lots of DoH requests), and that, by bypassing undici
/ node-fetch
polyfills (or running on cloudflare-workers or deno which bring their own fetch
impl to the table) from the code-path meant apparent improvements in IO.
I don't think anyone had performed an optimization of either fetch or node WHATWG Streams implementation, so this is a good point and there is likely a lot of margin there.
Have you tried using undici.request()
instead?
Are there some instructions on how to run the benchmarks? Being able to get some good diagnostics would definitely help.
Here is a bit of tests from our own benchmarks:
│ Tests │ Samples │ Result │ Tolerance │ Difference with slowest │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - fetch │ 20 │ 1028.31 req/sec │ ± 2.71 % │ - │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ http - no keepalive │ 10 │ 3891.51 req/sec │ ± 2.00 % │ + 278.44 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - pipeline │ 95 │ 6034.47 req/sec │ ± 2.95 % │ + 486.83 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ http - keepalive │ 50 │ 6382.57 req/sec │ ± 2.98 % │ + 520.68 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - request │ 15 │ 8528.35 req/sec │ ± 2.11 % │ + 729.35 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - stream │ 10 │ 10226.33 req/sec │ ± 2.66 % │ + 894.48 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - dispatch │ 50 │ 11399.31 req/sec │ ± 2.79 % │ + 1008.54 % │
There is likely a lot of room for improvements.
Have you tried using undici.request() instead?
I will do so over the coming days and report back.
Are there some instructions on how to run the benchmarks?
Apart from running the profiler.yml
gh-action (of your own fork of serverless-dns
), you could checkout the codebase, and then
# clone serverless-dns
git clone https://github.com/serverless-dns/serverless-dns.git
# download deps
npm i
# install q, a golang doh client
# https://github.com/natesales/q
echo "deb [trusted=yes] https://repo.natesales.net/apt /" | sudo tee /etc/apt/sources.list.d/natesales.list > /dev/null
sudo apt-get update
sudo apt-get install q
# export QDOH, where the 'run' script goes looking for `q`
export QDOH=q
# install node v15+ / deno v1.17+, if missing
...
# doh query, <random-str>.dnsleaktest.com, is sent to cloudflare-dns.com
# run doh, on node + undici
timeout 60s ./run node p1
# doh query, <random-str>.dnsleaktest.com, is sent to cloudflare-dns.com
# run doh, on deno
timeout 60s ./run deno p1
# dns query, <random-str>.dnsleaktest.com, is sent to 1.1.1.1
# run udp/tcp dns, on node
timeout 60s ./run node p3
Here is a bit of tests from our own benchmarks:
Thanks. Sorry, if this is a noob question, but: how do I generate such logs myself for my tests?
Thanks. Sorry, if this is a noob question, but: how do I generate such logs myself for my tests?
Check out https://github.com/nodejs/undici/pull/1214 for the fetch benchmarks.
You can run our benchmarks by running:
PORT=3001 node benchmarks/server.js
PORT=3001 node benchmarks/benchmarks.js
Hello folks! Recently, I spent some time working on the fetch
performance investigation and I decided to create a report about my findings. I'll share an abstract (probably the full report will become a blog post at some moment)
After some analysis, I came up with a few PRs improving fetch
performance directly:
However, the major bottleneck is under WebStreams
implementation in Node.js core, thus,
I don't think there is a clear path for undici.fetch
to improve its performance significantly without touching Node.js internals.
Firstly, the benchmark was unfair with undici.fetch
since it was allocating the server response in a variable causing an event-loop block due to GC runs. Then, I solved it in: https://github.com/nodejs/undici/pull/1283.
After several analyses, I have found that WebStreams
is very slow (through microbenchmark and by fetch
benchmark) - For reference, see https://github.com/RafaelGSS/nodejs-bench-operations/blob/main/RESULTS-v17.md#stream-readablejs. Thus, I did a comparison replacing WebStreams
with native Streams
and I got 60% ~ 90% of improvement.
If you apply the above git diff and profile the application, it will show createAbortError
as one of the functions wasting more time on the CPU.
And, these AbortSignal
are mostly created in WebStreams
instance (even if the instance is not in use), e.g:
So, avoiding the AbortSignal
creation through WebStreams
bumps, even more, the performance (150% ~ 200% -- based on the initial benchmarks)
Then, I came up with avoiding AbortController
(that creates the AbortSignal
under the hood), and impressively, it bumped the performance by 20%!
In the first iteration, Clinic Doctor
report says something is wrong with Active Handles
The Active Handles
are dropping often, which is far from expected. For comparison reasons, look at the undici.request
graph:
There is probably a bad configuration out there and I have found that undici.fetch
is dropping many packages during the benchmark causing TCP RETRANSMISSION.
Basically, when a package is dropped in an active connection a re-transmission protocol is sent and when it happens frequently, the performance is affected.
TCP Retransmission using undici.fetch
root@rafaelgss-desktop:/usr/share/bcc/tools# ./tcpretrans -c
Tracing retransmits ... Hit Ctrl-C to end
^C
LADDR:LPORT RADDR:RPORT RETRANSMITS
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52390 99
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52400 99
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52398 99
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52402 99
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52414 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52424 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52412 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52392 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52420 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52396 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52422 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52404 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52410 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52428 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52418 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52408 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52426 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52416 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52406 100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52394 101
For comparison reasons, looks the retransmissions using
root@rafaelgss-desktop:/usr/share/bcc/tools# ./tcpretrans -c
Tracing retransmits ... Hit Ctrl-C to end
^C
LADDR:LPORT RADDR:RPORT RETRANSMITS
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52526 1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52532 1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52506 1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52514 1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52524 1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52512 1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52518 1
undici.request
root@rafaelgss-desktop:/usr/share/bcc/tools# ./tcpretrans -c
Tracing retransmits ... Hit Ctrl-C to end
^C
EMPTY
LADDR:LPORT RADDR:RPORT RETRANSMITS
Unfortunately, I haven't found out why it's happening. I tried, I swear :P.
@benjamingr why is creating abort signals so expensive?
I would guess it is because createAbortSignal+makeTransferable does a lot of prototype mutation stuff and a call into c++ which is basically the bane of js engine performance.
i think we could move the logic of createAbortSignal
into the AbortSignal constructor, sorta like this:
const kDoThing = Symbol('kDoThing');
function createAbortSignal(aborted, reason) {
return new AbortSignal(aborted, reason, kDoThing);
}
class AbortSignal extends EventTarget {
constructor(aborted, reason, thing) {
if (thing !== kDoThing) throw new ERR_ILLEGAL_CONSTRUCTOR();
super();
this[kAborted] = aborted;
this[kReason] = reason;
}
}
the makeTransferable part is harder cuz c++ but i feel like we could at least make it more static than what makeTransferable
is doing, similar to what i did above. also seems weird to me that JSTransferable
needs to exist at all, like we already have the api with kTransfer
and kDeserialize
and whatnot but I don't know all the specifics there.
@jasnell @addaleax
@devsnek I didn't mention the analysis I did in Node.js Webstreams (planning to open an issue in Node.js any time soon), but indeed, makeTransferable
has something on it for sure.
makeTransferable
is an ugly hack, unfortunately. I'd love to find a way to handle it more elegantly but we are fairly strictly limited by the way the structured clone algorithm works and the v8 implementation of it. I have some ideas on how to make improvements there.
also seems weird to me that JSTransferable needs to exist at all, like we already have the api with kTransfer and kDeserialize and whatnot but I don't know all the specifics there.
The way it works is this:
The structured clone algorithm only works by default with ordinary JavaScript objects with enumerable properties. Things like JavaScript classes are not supported. In order to make objects like WritableStream
and ReadableStream
consistently transferable per the spec is to treat them as "Host Objects". The JSTransferable
class is exactly how we do that. (A Host object is a JavaScript object that is backed by a C++ class.).
When a Host object is passed into the structured clone algorithm implementation, v8 will call out to the serialization/deserialization delegate implementation that Node.js provides. Node.js' delegate implementation will check to see what kind of Host object it is. If the Host object is just a native c++ class that extends from BaseObject
, then the c++ class will have functions on it for handling the serialization/deserialization at the native layer. If the Host object is a JSTransferable
, that means that the serialization/deserialization is handled by JavaScript functions on the JavaScript wrapper object -- specifically, the [kTransfer]
, [kClone]
, [kDeserialize]
functions. The delegate will as the JSTransferable
host object to serialize itself, and it will do so by calling out to the JavaScript functions..... tl;dr: JSTransferable
is how we allow a JavaScript class to act as a Host object, and the kTransfer
/kDeserialize
do the actual work of that.
But, we have a problem here. Classes like WritableStream
and ReadableStream
are defined by the standards as not extending from any other class and there are Web Platform Tests that verify that the implementation matches the Web IDL definition. Specifically, we can't have class WritableStream extends JSTransferable
because that would violate the standard definition of WritableStream
.
The makeTransferable()
utility is a hack to get around that. Instead of creating a WritableStream
that extends from JSTransferable
, makeTransferable()
creates an instance of JSTransferable
and uses the WritableStream
instance as its prototype. It effectively inverts the inheritance (that is, creates a JSTransferable
that extends from WritableStream
). This is a bit slow for several reasons that should be fairly obvious. Performance-wise, it's not great. But it allows us to implement the streams spec correctly.
Looking at this more, it seems like we just need to teach v8 how to recognize non-c++-backed "host" objects, and then we wouldn't need to add the JSTransferable superclass.
oop thank you for the writeup james
Yes, it would be ideal if the structured clone algorithm provided a way to allow arbitrary classes to define how they would be serialized / deserialized. I've raised this conversation with the WHATWG (https://github.com/whatwg/html/issues/7428) before and, unfortunately, it received a pretty chilly response.
An approach based on well-known Symbols (e.g. Symbol.transfer
, Symbol.clone
, Symbol.deserialize
) would be useful, where v8 would recognize that objects that implement these should be passed to the serializer/deserializer delegate. There's some complexity on the deserialization end around how to make the deserializer aware of what classes it needs to handle but that's largely implementation detail.
Unfortunately, without these changes at either the standard (tc-39, whatwg) level or v8 level, JSTransferable
will be a necessary evil.
I don't think we need to do anything at the standards level, if v8 just had an api to tag constructors with a private symbol or similar, it could forward them back to WriteHostObject
when it sees them and then we could do whatever we want with them.
Yeah, doing it entirely within v8 is possible. My concern there would be interop. I would much prefer the mechanism to be standardized so that there's a better hope that it would work consistently across runtimes.
Within v8, the key limitation is that, inside v8's implementation, only C++-backed objects can be identified as host objects, and only host objects are passed into the delegate to be handled, and since a host object can be nested arbitrarily deep in object graphs, we have to depend on v8 identifying them as it walks the graph being serialized. If we can start there -- teaching v8 to recognize when a pure JavaScript object is to be treated as a host object -- then we've gone a long way towards solving the problem.
I've made a small benchmark comparing undici.fetch
with node-fetch
. It fetches fetches metadata for 1500 packages from npm and I do find undici being consistently slower. Average run:
$ node bench.js
undici: 3522ms
node-fetch: 2492ms
Hey!
So, I did an investigation on undici.fetch
after the https://github.com/nodejs/node/pull/44048 and basically, we are still slow.
As an abstract, all the points raised in the first analysis are still valid (WebStreams, AbortController, TCP Retransmission). Apart from WebStreams, it seems the next issue of fetch is memory usage. Garbage Collection uses a considerable amount of time.
Furthermore in: https://nova-option-465.notion.site/Undici-Fetch-Performance-Analysis-v2-81113b868be44b7a8f4dad03372db1db
I wonder if the AbortController creation is really necessary when signal
is not provided by the user. Seems wasteful to create it when it's never used.
Generally I thought AbortController
is meant to always be created in user code, not internally inside a fetch
implementation.
Generally I thought AbortController is meant to always be created in user code, not internally inside a fetch implementation.
An abort signal is created weather you want it or not, simple web test new Request('/').signal
An abort signal is created weather you want it or not, simple web test new Request('/').signal
It's possible to only create a signal when accessed, since signal is a getter. I'm unsure if this would improve performance though.
There isn't much we can do with it; it's public and used directly below where we create it. I assume this is also the case for most structures in the code.
There does seem to be a performance improvement from v18.10.0 -> v18.11.0, which is probably attributed to the AbortController changes.
v18.11.0:
v18.10.0:
I updated my unscientific fetch benchmarks for parallel fetch and I think the situation has slightly improved for undici. I'm now also benchmarking deno's fetch
and it does seem supremely fast for this specific use case (many parallel requests):
axios: 2564ms
node-fetch: 5142ms
undici: 3590ms
deno: 1761ms
You might want to add undici.request()
, which should hopefully be in the same bracket of Deno.
Most of undici.fetch
overhead come from Node.js implementation of WHATWG Streams.
You might want to add
undici.request()
, which should hopefully be in the same bracket of Deno.Most of
undici.fetch
overhead come from Node.js implementation of WHATWG Streams.
@mcollina I've tried this still slower than deno and bun :
root@:~# node node.js
request-time: 299.406ms
Status code: 200
node version: v19.4.0
root@:~# bun bun.js
[229.99ms] request-time
Status code: 200
bun version: 0.5.0
root@:~# deno run --allow-net deno.js
request-time: 226ms
Status code: 200
deno version: 1.29.4
root@:~# ./go
Time taken: 242 ms
Status code: 200
@mcollina I've tried this still slower than deno and bun :
How are you doing it with undici.request
? Do you have source for that?
@mcollina I've tried this still slower than deno and bun :
How are you doing it with
undici.request
? Do you have source for that?
const { request } = require('undici')
const url = 'url';
console.time('request-time');
request(url)
.then(res => {
console.timeEnd('request-time');
console.log(`Status code: ${res.status}`);
});
You are not reading the body there. Also might be good idea to enable pipelineing (which is disabled by default).
You are not reading the body there. Also might be good idea to enable pipelineing (which is disabled by default).
i'm not reading the body in all tests above just to make the test focus on the network side
will try to enable pipelineing
i'm not reading the body in all tests above just to make the test focus on the network side
If you don't read the body then sockets won't be re-used. You must always read the body even if it's empty.
Yes, reading body is important. Without it I have observed odd behaviour where requests would start to hang indefinitely (might be a bug in itself). I suggest using .text()
for minimal parser impact.
axios: 3987ms
node-fetch: 3392ms
undici-fetch: 2178ms
deno: 1952ms
bun: 2363ms
@ronag feel free to PR undici request
, but please keep undici fetch
:)
With pipelining:
setGlobalDispatcher(new Agent({
pipelining: 3
}))
axios: 3511ms
node-fetch: 3111ms
undici-fetch: 2411ms
deno: 3086ms
bun: 2373ms
I'm not sure using registry.npmjs.org
as a benchmark endpoint is optimal for accurate benchmarking :)
. Without it I have observed odd behaviour where requests would start to hang indefinitely (might be a bug in itself)
It's not a bug. Consuming the body is a requirement.
I'm not sure using registry.npmjs.org as a benchmark endpoint is optimal for accurate benchmarking :)
Yeah it's definitely not ideal. I think they also rate-limit parallel connections. I have observed ECONNRESET when bumping the concurrency too high :)
@RafaelGSS TCP Retransmission might be related to tcpNoDelay. Have you tried disabling that?
waiting for 19.5.0 to test fetch result just saw @RafaelGSS pr 👍
@RafaelGSS TCP Retransmission might be related to tcpNoDelay. Have you tried disabling that?
That's a good point. I haven't.
using v20 nightly build I can see no diffrence between deno, and node fetch infact it looks better than deno .. but bun slightly faster ... good job guys
fetch seems to have gotten much slower in v19.7.0 compared to v19.6.1
v19.6.1:
[bench:run] │ Tests │ Samples │ Result │ Tolerance │ Difference with slowest │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - fetch │ 1 │ 766.77 req/sec │ ± 0.00 % │ - │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ http - no keepalive │ 1 │ 1345.28 req/sec │ ± 0.00 % │ + 75.45 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ http - keepalive │ 1 │ 2246.10 req/sec │ ± 0.00 % │ + 192.93 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - request │ 1 │ 2372.56 req/sec │ ± 0.00 % │ + 209.42 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - pipeline │ 1 │ 2522.22 req/sec │ ± 0.00 % │ + 228.94 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - stream │ 1 │ 3170.79 req/sec │ ± 0.00 % │ + 313.52 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - dispatch │ 1 │ 3273.41 req/sec │ ± 0.00 % │ + 326.91 % │
v19.7.0:
[bench:run] │ Tests │ Samples │ Result │ Tolerance │ Difference with slowest │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - fetch │ 1 │ 564.40 req/sec │ ± 0.00 % │ - │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ http - no keepalive │ 1 │ 1295.50 req/sec │ ± 0.00 % │ + 129.54 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ http - keepalive │ 1 │ 1959.11 req/sec │ ± 0.00 % │ + 247.12 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - request │ 1 │ 2297.02 req/sec │ ± 0.00 % │ + 306.99 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - pipeline │ 1 │ 2624.22 req/sec │ ± 0.00 % │ + 364.96 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - stream │ 1 │ 2977.49 req/sec │ ± 0.00 % │ + 427.55 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - dispatch │ 1 │ 3473.19 req/sec │ ± 0.00 % │ + 515.38 % │
cc @anonrig
I don't think this is related to URL, but related to stream performance.
ReadableStream/WritableStream abort-signal might cause this regression. (https://github.com/nodejs/node/pull/46273) cc @debadree25 @ronag
This would solve...
Undici fetch isn't as fast as the ones in browsers like Firefox / Deno / Workers.
The implementation should look like...
Measure and improve (e2e latencies, op/sec, conns/sec, etc?) to match up with other impls.
I have also considered...
I hand-rolled a http2 nodejs stdlib client and it was a magnitude faster than undici fetch, even without connection pooling.
I must note though, the conns were to Cloudflare and Google endpoints, which responded with 1KB or less per fetch (basically,
POST
and uncachedGET
DNS over HTTPS connections).Additional context
This is a follow up from: https://news.ycombinator.com/item?id=30164925
Sorry that I didn't really benchmark the difference between Deno / Firefox / Undici / Cloudflare Workers, but I knew instantly that it was the cause of higher latencies when I deployed the same code in Node v16 LTS.
Possibly related: #22, #208, #952, #902
Other projects: szmarczak/http2-wrapper/issues/71