fetch() performance parity

ignoramous commented 2 years ago

This would solve...

Undici fetch isn't as fast as the ones in browsers like Firefox / Deno / Workers.

The implementation should look like...

Measure and improve (e2e latencies, op/sec, conns/sec, etc?) to match up with other impls.

I have also considered...

I hand-rolled a http2 nodejs stdlib client and it was a magnitude faster than undici fetch, even without connection pooling.

I must note though, the conns were to Cloudflare and Google endpoints, which responded with 1KB or less per fetch (basically, POST and uncached GET DNS over HTTPS connections).

Additional context

This is a follow up from: https://news.ycombinator.com/item?id=30164925

Sorry that I didn't really benchmark the difference between Deno / Firefox / Undici / Cloudflare Workers, but I knew instantly that it was the cause of higher latencies when I deployed the same code in Node v16 LTS.

Possibly related: #22, #208, #952, #902

Other projects: szmarczak/http2-wrapper/issues/71

ronag commented 2 years ago

Do you have a reproducible benchmark we can use?

mcollina commented 2 years ago

Undici fetch isn't as fast as the ones in browsers like Firefox / Deno / Workers.

I took a look at the comment in hackernews and the comment was not including benchmarks either.

I'm prone to close this unless somebody would like to prepare a cross-runtime benchmark suite.

ignoramous commented 2 years ago

Fair. I'll take a stab at extracting out the parts from my server to demonstrate this. I have to note that the slowness I observed was in a production env (running in a VM in a container behind a network load balancer) that also sees a tonne of other IO.

ignoramous commented 2 years ago

I wrote up a tiny DoH load-tester with python and golang (don't ask) which I intend to publish as a github action by tomorrow. In the meanwhile, here's a run from my laptop:

deno/runs:3748
percentile	p10	p50	p75	p90	p95	p99	p99.9	p99.99	p100
milliseconds	280	296	309	336	411	694	889	898	929

undici/runs:3826
percentile	p10	p50	p75	p90	p95	p99	p99.9	p99.99	p100
milliseconds	1140	1666	2171	2811	3302	3941	4501	4540	4995

node:naive-h2/runs:3791
percentile	p10	p50	p75	p90	p95	p99	p99.9	p99.99	p100
milliseconds	1713	2067	2570	3276	3975	4269	4336	4530	4534

ignoramous commented 2 years ago

After fighting a bit with github-actions, a poor man's profiler on is finally up:

From the current run, here's Node:	node/runs:1200
percentile	p10	p50	p75	p90	p95	p99	p99.9	p99.99	p100
milliseconds	532	1215	1665	3359	3469	20953	21094	21125	21182

here's Deno:	deno/runs:6300
percentile	p10	p50	p75	p90	p95	p99	p99.9	p99.99	p100
milliseconds	87	97	104	111	116	127	165	184	189

These are measured from Date.now wrapping around the fetch call. Both local and remote caches are unlikely to interfere because every DoH query is of form <random-str>.dnsleaktest.com.

Anything in the serverless-dns code-base that may significantly interfere with timing fetch calls is disabled.

The profiler forwards as many DoH requests it can for 50s to 60s, after which it quits.

Deno's serveDoh doesn't slow down at all, and blazes through 6300+ requests. Node's serverHTTPS (+ undici), otoh, manages 1200+ requests. I must note that the final batch of 100 queries served after the ~1100th request, took 20s+ to complete.

Just a note: I haven't used Node's built-in perf-module to profile since that is un-importable on Deno (from what I can tell)... hand-rolled a minimal and inefficient perf-counter on my own, instead.

mcollina commented 2 years ago

I might be a bit lost on your codebase, but I could not find undici being used anywhere.

I would note that there are plenty of things happening on that module. How have you identified the problem in undici fetch implementation?

mcollina commented 2 years ago

If by any chance you are testing the http/2 implementation, consider that https://github.com/serverless-dns/serverless-dns/blob/b36a9cdb5786b78a7dd8e22a2f0a767e9e340ab1/src/plugins/dns-operation/dnsResolver.js#L398 is highly inefficient as you will be incurring in significant overhead of creating those http/2 connections. You have to use a connection pool.

ignoramous commented 2 years ago

I might be a bit lost on your codebase, but I could not find undici being used anywhere.

Dang, we moved to node-fetch (to see if it was any better) and didn't revert back to undici (which we now have).

Here's results from Node + undici run:

node/undici/runs:1300
percentile	p10	p50	p75	p90	p95	p99	p99.9	p99.99	p100
milliseconds	200	602	965	1121	3083	12267	20400	20407	20467

How have you identified the problem in undici fetch implementation?

What's measured is just this method dnsResolver.js:resolveDnsUpstream, which doesn't do much other than construct a Request and call cache-api (does not exist on Node, disabled either way on these "profiler" runs), fetch (over undici on Node), or doh2 (unused).

We then rolled our own DNS over UDP / DNS over TCP connection-pooled transport atop Node (dgram and net), and that showed considerable perf improvement (around 15x to 100x):

node/udp/runs:6600
percentile	p10	p50	p75	p90	p95	p99	p99.9	p99.99	p100
milliseconds	17	18	19	20	21	25	34	89	131

Of course, a standalone benchmark would be great to have, but I wanted to demonstrate the issue I spot with the kind of workloads we specifically run (lots and lots of DoH requests), and that, by bypassing undici / node-fetch polyfills (or running on cloudflare-workers or deno which bring their own fetch impl to the table) from the code-path meant apparent improvements in IO.

mcollina commented 2 years ago

I don't think anyone had performed an optimization of either fetch or node WHATWG Streams implementation, so this is a good point and there is likely a lot of margin there.

Have you tried using undici.request() instead?

Are there some instructions on how to run the benchmarks? Being able to get some good diagnostics would definitely help.

mcollina commented 2 years ago

Here is a bit of tests from our own benchmarks:

│ Tests               │ Samples │           Result │ Tolerance │ Difference with slowest │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - fetch      │      20 │  1028.31 req/sec │  ± 2.71 % │                       - │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ http - no keepalive │      10 │  3891.51 req/sec │  ± 2.00 % │              + 278.44 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - pipeline   │      95 │  6034.47 req/sec │  ± 2.95 % │              + 486.83 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ http - keepalive    │      50 │  6382.57 req/sec │  ± 2.98 % │              + 520.68 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - request    │      15 │  8528.35 req/sec │  ± 2.11 % │              + 729.35 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - stream     │      10 │ 10226.33 req/sec │  ± 2.66 % │              + 894.48 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - dispatch   │      50 │ 11399.31 req/sec │  ± 2.79 % │             + 1008.54 % │

There is likely a lot of room for improvements.

ignoramous commented 2 years ago

Have you tried using undici.request() instead?

I will do so over the coming days and report back.

Are there some instructions on how to run the benchmarks?

Apart from running the profiler.yml gh-action (of your own fork of serverless-dns), you could checkout the codebase, and then

# clone serverless-dns
git clone https://github.com/serverless-dns/serverless-dns.git

# download deps
npm i

# install q, a golang doh client
# https://github.com/natesales/q
echo "deb [trusted=yes] https://repo.natesales.net/apt /" | sudo tee /etc/apt/sources.list.d/natesales.list > /dev/null
sudo apt-get update
sudo apt-get install q

# export QDOH, where the 'run' script goes looking for `q`
export QDOH=q

# install node v15+ / deno v1.17+, if missing
...

# doh query, <random-str>.dnsleaktest.com, is sent to cloudflare-dns.com
# run doh, on node + undici
timeout 60s ./run node p1

# doh query, <random-str>.dnsleaktest.com, is sent to cloudflare-dns.com
# run doh, on deno
timeout 60s ./run deno p1

# dns query, <random-str>.dnsleaktest.com, is sent to 1.1.1.1
# run udp/tcp dns, on node
timeout 60s ./run node p3

Here is a bit of tests from our own benchmarks:

Thanks. Sorry, if this is a noob question, but: how do I generate such logs myself for my tests?

mcollina commented 2 years ago

Thanks. Sorry, if this is a noob question, but: how do I generate such logs myself for my tests?

Check out https://github.com/nodejs/undici/pull/1214 for the fetch benchmarks.

You can run our benchmarks by running:

PORT=3001 node benchmarks/server.js

PORT=3001 node benchmarks/benchmarks.js

RafaelGSS commented 2 years ago

Hello folks! Recently, I spent some time working on the fetch performance investigation and I decided to create a report about my findings. I'll share an abstract (probably the full report will become a blog post at some moment)

TL;DR:

After some analysis, I came up with a few PRs improving fetch performance directly:

However, the major bottleneck is under WebStreams implementation in Node.js core, thus, I don't think there is a clear path for undici.fetch to improve its performance significantly without touching Node.js internals.

Firstly, the benchmark was unfair with undici.fetch since it was allocating the server response in a variable causing an event-loop block due to GC runs. Then, I solved it in: https://github.com/nodejs/undici/pull/1283.

After several analyses, I have found that WebStreams is very slow (through microbenchmark and by fetch benchmark) - For reference, see https://github.com/RafaelGSS/nodejs-bench-operations/blob/main/RESULTS-v17.md#stream-readablejs. Thus, I did a comparison replacing WebStreams with native Streams and I got 60% ~ 90% of improvement.

Avoiding WebStreams in response

```jsx diff --git a/lib/fetch/index.js b/lib/fetch/index.js index 9047976..778ad82 100644 --- a/lib/fetch/index.js +++ b/lib/fetch/index.js @@ -50,6 +50,7 @@ const { const { kHeadersList } = require('../core/symbols') const EE = require('events') const { Readable, pipeline } = require('stream') +const Readable2 = require('../api/readable') const { isErrored, isReadable } = require('../core/util') const { dataURLProcessor } = require('./dataURL') const { kIsMockActive } = require('../mock/mock-symbols') @@ -964,7 +965,7 @@ async function fetchFinale (fetchParams, response) { }) // 4. Set response’s body to the result of piping response’s body through transformStream. - response.body = { stream: response.body.stream.pipeThrough(transformStream) } + // response.body = { stream: response.body.stream.pipeThrough(transformStream) } } // 6. If fetchParams’s process response consume body is non-null, then: @@ -1731,9 +1732,8 @@ async function httpNetworkFetch ( })() } + const { body, status, statusText, headersList } = await dispatch({ body: requestBody }) try { - const { body, status, statusText, headersList } = await dispatch({ body: requestBody }) - const iterator = body[Symbol.asyncIterator]() fetchParams.controller.next = () => iterator.next() @@ -1797,7 +1797,7 @@ async function httpNetworkFetch ( // 17. Run these steps, but abort when the ongoing fetch is terminated: // 1. Set response’s body to a new body whose stream is stream. - response.body = { stream } + response.body = { stream: body } // 2. If response is not a network error and request’s cache mode is // not "no-store", then update response in httpCache for request. @@ -1957,7 +1957,7 @@ async function httpNetworkFetch ( headers.append(key, val) } - this.body = new Readable({ read: resume }) + this.body = new Readable2(resume, this.abort, headers.get('content-type')) const decoders = [] ```

If you apply the above git diff and profile the application, it will show createAbortError as one of the functions wasting more time on the CPU.

And, these AbortSignal are mostly created in WebStreams instance (even if the instance is not in use), e.g:

So, avoiding the AbortSignal creation through WebStreams bumps, even more, the performance (150% ~ 200% -- based on the initial benchmarks)

Avoiding AbortSignal

```jsx diff --git a/lib/fetch/index.js b/lib/fetch/index.js index 1fbf29b..322e5ae 100644 --- a/lib/fetch/index.js +++ b/lib/fetch/index.js @@ -50,6 +50,7 @@ const { const { kHeadersList } = require('../core/symbols') const EE = require('events') const { Readable, pipeline } = require('stream') +const Readable2 = require('../api/readable') const { isErrored, isReadable } = require('../core/util') const { dataURLProcessor } = require('./dataURL') const { kIsMockActive } = require('../mock/mock-symbols') @@ -957,14 +958,14 @@ async function fetchFinale (fetchParams, response) { // 3. Set up transformStream with transformAlgorithm set to identityTransformAlgorithm // and flushAlgorithm set to processResponseEndOfBody. - const transformStream = new TransformStream({ - start () {}, - transform: identityTransformAlgorithm, - flush: processResponseEndOfBody - }) + // const transformStream = new TransformStream({ + // start () {}, + // transform: identityTransformAlgorithm, + // flush: processResponseEndOfBody + // }) // 4. Set response’s body to the result of piping response’s body through transformStream. - response.body = { stream: response.body.stream.pipeThrough(transformStream) } + // response.body = { stream: response.body.stream.pipeThrough(transformStream) } } // 6. If fetchParams’s process response consume body is non-null, then: @@ -1731,9 +1732,8 @@ async function httpNetworkFetch ( })() } + const { body, status, statusText, headersList } = await dispatch({ body: requestBody }) try { - const { body, status, statusText, headersList } = await dispatch({ body: requestBody }) - const iterator = body[Symbol.asyncIterator]() fetchParams.controller.next = () => iterator.next() @@ -1775,29 +1775,29 @@ async function httpNetworkFetch ( // 16. Set up stream with pullAlgorithm set to pullAlgorithm, // cancelAlgorithm set to cancelAlgorithm, highWaterMark set to // highWaterMark, and sizeAlgorithm set to sizeAlgorithm. - if (!ReadableStream) { - ReadableStream = require('stream/web').ReadableStream - } - - const stream = new ReadableStream( - { - async start (controller) { - fetchParams.controller.controller = controller - }, - async pull (controller) { - await pullAlgorithm(controller) - }, - async cancel (reason) { - await cancelAlgorithm(reason) - } - }, - { highWaterMark: 0 } - ) + // if (!ReadableStream) { + // ReadableStream = require('stream/web').ReadableStream + // } + + // const stream = new ReadableStream( + // { + // async start (controller) { + // fetchParams.controller.controller = controller + // }, + // async pull (controller) { + // await pullAlgorithm(controller) + // }, + // async cancel (reason) { + // await cancelAlgorithm(reason) + // } + // }, + // { highWaterMark: 0 } + // ) // 17. Run these steps, but abort when the ongoing fetch is terminated: // 1. Set response’s body to a new body whose stream is stream. - response.body = { stream } + response.body = { stream: body } // 2. If response is not a network error and request’s cache mode is // not "no-store", then update response in httpCache for request. @@ -1870,10 +1870,10 @@ async function httpNetworkFetch ( fetchParams.controller.controller.enqueue(new Uint8Array(bytes)) // 8. If stream is errored, then terminate the ongoing fetch. - if (isErrored(stream)) { - fetchParams.controller.terminate() - return - } + // if (isErrored(stream)) { + // fetchParams.controller.terminate() + // return + // } // 9. If stream doesn’t need more data ask the user agent to suspend // the ongoing fetch. @@ -1891,16 +1891,16 @@ async function httpNetworkFetch ( response.aborted = true // 2. If stream is readable, error stream with an "AbortError" DOMException. - if (isReadable(stream)) { - fetchParams.controller.controller.error(new AbortError()) - } + // if (isReadable(stream)) { + // fetchParams.controller.controller.error(new AbortError()) + // } } else { // 3. Otherwise, if stream is readable, error stream with a TypeError. - if (isReadable(stream)) { - fetchParams.controller.controller.error(new TypeError('terminated', { - cause: reason instanceof Error ? reason : undefined - })) - } + // if (isReadable(stream)) { + // fetchParams.controller.controller.error(new TypeError('terminated', { + // cause: reason instanceof Error ? reason : undefined + // })) + // } } // 4. If connection uses HTTP/2, then transmit an RST_STREAM frame. @@ -1957,7 +1957,7 @@ async function httpNetworkFetch ( headers.append(key, val) } - this.body = new Readable({ read: resume }) + this.body = new Readable2(resume, this.abort, headers.get('content-type')) const decoders = [] ```

Then, I came up with avoiding AbortController (that creates the AbortSignal under the hood), and impressively, it bumped the performance by 20%!

Avoiding AbortController

```jsx diff --git a/lib/fetch/request.js b/lib/fetch/request.js index 0f10e67..ae9c37c 100644 --- a/lib/fetch/request.js +++ b/lib/fetch/request.js @@ -29,9 +29,9 @@ let TransformStream const kInit = Symbol('init') -const requestFinalizer = new FinalizationRegistry(({ signal, abort }) => { - signal.removeEventListener('abort', abort) -}) +// const requestFinalizer = new FinalizationRegistry(({ signal, abort }) => { +// signal.removeEventListener('abort', abort) +// }) // https://fetch.spec.whatwg.org/#request-class class Request { @@ -355,7 +355,7 @@ class Request { // 28. Set this’s signal to a new AbortSignal object with this’s relevant // Realm. - const ac = new AbortController() + const ac = { signal: { addEventListener: () => {} } } this[kSignal] = ac.signal this[kSignal][kRealm] = this[kRealm] @@ -376,7 +376,7 @@ class Request { } else { const abort = () => ac.abort() signal.addEventListener('abort', abort, { once: true }) - requestFinalizer.register(this, { signal, abort }) + // requestFinalizer.register(this, { signal, abort }) } } @@ -741,7 +741,8 @@ class Request { clonedRequestObject[kHeaders][kRealm] = this[kHeaders][kRealm] // 4. Make clonedRequestObject’s signal follow this’s signal. - const ac = new AbortController() + const ac = { signal: { addEventListener: () => {}, abort: () => {} } } + // const ac = new AbortController() if (this.signal.aborted) { ac.abort() } else { ```

Additional Notes

In the first iteration, Clinic Doctor report says something is wrong with Active Handles

The Active Handles are dropping often, which is far from expected. For comparison reasons, look at the undici.request graph:

There is probably a bad configuration out there and I have found that undici.fetch is dropping many packages during the benchmark causing TCP RETRANSMISSION.

Basically, when a package is dropped in an active connection a re-transmission protocol is sent and when it happens frequently, the performance is affected.

TCP Retransmission using undici.fetch

root@rafaelgss-desktop:/usr/share/bcc/tools# ./tcpretrans -c
Tracing retransmits ... Hit Ctrl-C to end
^C
LADDR:LPORT               RADDR:RPORT               RETRANSMITS
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52390         99
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52400         99
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52398         99
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52402         99
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52414        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52424        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52412        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52392        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52420        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52396        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52422        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52404        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52410        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52428        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52418        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52408        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52426        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52416        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52406        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52394        101

For comparison reasons, looks the retransmissions using

http.request (keep alive)

root@rafaelgss-desktop:/usr/share/bcc/tools# ./tcpretrans -c
Tracing retransmits ... Hit Ctrl-C to end
^C
LADDR:LPORT               RADDR:RPORT               RETRANSMITS
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52526          1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52532          1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52506          1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52514          1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52524          1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52512          1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52518          1

undici.request

root@rafaelgss-desktop:/usr/share/bcc/tools# ./tcpretrans -c
Tracing retransmits ... Hit Ctrl-C to end
^C
EMPTY
LADDR:LPORT               RADDR:RPORT               RETRANSMITS

Unfortunately, I haven't found out why it's happening. I tried, I swear :P.

ronag commented 2 years ago

@benjamingr why is creating abort signals so expensive?

devsnek commented 2 years ago

I would guess it is because createAbortSignal+makeTransferable does a lot of prototype mutation stuff and a call into c++ which is basically the bane of js engine performance.

i think we could move the logic of createAbortSignal into the AbortSignal constructor, sorta like this:

const kDoThing = Symbol('kDoThing');
function createAbortSignal(aborted, reason) {
  return new AbortSignal(aborted, reason, kDoThing);
}

class AbortSignal extends EventTarget {
  constructor(aborted, reason, thing) {
    if (thing !== kDoThing) throw new ERR_ILLEGAL_CONSTRUCTOR();
    super();
    this[kAborted] = aborted;
    this[kReason] = reason;
  }
}

the makeTransferable part is harder cuz c++ but i feel like we could at least make it more static than what makeTransferable is doing, similar to what i did above. also seems weird to me that JSTransferable needs to exist at all, like we already have the api with kTransfer and kDeserialize and whatnot but I don't know all the specifics there.

ronag commented 2 years ago

@jasnell @addaleax

RafaelGSS commented 2 years ago

@devsnek I didn't mention the analysis I did in Node.js Webstreams (planning to open an issue in Node.js any time soon), but indeed, makeTransferable has something on it for sure.

jasnell commented 2 years ago

makeTransferable is an ugly hack, unfortunately. I'd love to find a way to handle it more elegantly but we are fairly strictly limited by the way the structured clone algorithm works and the v8 implementation of it. I have some ideas on how to make improvements there.

also seems weird to me that JSTransferable needs to exist at all, like we already have the api with kTransfer and kDeserialize and whatnot but I don't know all the specifics there.

The way it works is this:

The structured clone algorithm only works by default with ordinary JavaScript objects with enumerable properties. Things like JavaScript classes are not supported. In order to make objects like WritableStream and ReadableStream consistently transferable per the spec is to treat them as "Host Objects". The JSTransferable class is exactly how we do that. (A Host object is a JavaScript object that is backed by a C++ class.).

When a Host object is passed into the structured clone algorithm implementation, v8 will call out to the serialization/deserialization delegate implementation that Node.js provides. Node.js' delegate implementation will check to see what kind of Host object it is. If the Host object is just a native c++ class that extends from BaseObject, then the c++ class will have functions on it for handling the serialization/deserialization at the native layer. If the Host object is a JSTransferable, that means that the serialization/deserialization is handled by JavaScript functions on the JavaScript wrapper object -- specifically, the [kTransfer], [kClone], [kDeserialize] functions. The delegate will as the JSTransferable host object to serialize itself, and it will do so by calling out to the JavaScript functions..... tl;dr: JSTransferable is how we allow a JavaScript class to act as a Host object, and the kTransfer/kDeserialize do the actual work of that.

But, we have a problem here. Classes like WritableStream and ReadableStream are defined by the standards as not extending from any other class and there are Web Platform Tests that verify that the implementation matches the Web IDL definition. Specifically, we can't have class WritableStream extends JSTransferable because that would violate the standard definition of WritableStream.

The makeTransferable() utility is a hack to get around that. Instead of creating a WritableStream that extends from JSTransferable, makeTransferable() creates an instance of JSTransferable and uses the WritableStream instance as its prototype. It effectively inverts the inheritance (that is, creates a JSTransferable that extends from WritableStream). This is a bit slow for several reasons that should be fairly obvious. Performance-wise, it's not great. But it allows us to implement the streams spec correctly.

devsnek commented 2 years ago

Looking at this more, it seems like we just need to teach v8 how to recognize non-c++-backed "host" objects, and then we wouldn't need to add the JSTransferable superclass.

oop thank you for the writeup james

jasnell commented 2 years ago

Yes, it would be ideal if the structured clone algorithm provided a way to allow arbitrary classes to define how they would be serialized / deserialized. I've raised this conversation with the WHATWG (https://github.com/whatwg/html/issues/7428) before and, unfortunately, it received a pretty chilly response.

An approach based on well-known Symbols (e.g. Symbol.transfer, Symbol.clone, Symbol.deserialize) would be useful, where v8 would recognize that objects that implement these should be passed to the serializer/deserializer delegate. There's some complexity on the deserialization end around how to make the deserializer aware of what classes it needs to handle but that's largely implementation detail.

Unfortunately, without these changes at either the standard (tc-39, whatwg) level or v8 level, JSTransferable will be a necessary evil.

devsnek commented 2 years ago

I don't think we need to do anything at the standards level, if v8 just had an api to tag constructors with a private symbol or similar, it could forward them back to WriteHostObject when it sees them and then we could do whatever we want with them.

jasnell commented 2 years ago

Yeah, doing it entirely within v8 is possible. My concern there would be interop. I would much prefer the mechanism to be standardized so that there's a better hope that it would work consistently across runtimes.

jasnell commented 2 years ago

Within v8, the key limitation is that, inside v8's implementation, only C++-backed objects can be identified as host objects, and only host objects are passed into the delegate to be handled, and since a host object can be nested arbitrarily deep in object graphs, we have to depend on v8 identifying them as it walks the graph being serialized. If we can start there -- teaching v8 to recognize when a pure JavaScript object is to be treated as a host object -- then we've gone a long way towards solving the problem.

silverwind commented 2 years ago

I've made a small benchmark comparing undici.fetch with node-fetch. It fetches fetches metadata for 1500 packages from npm and I do find undici being consistently slower. Average run:

$ node bench.js
undici: 3522ms
node-fetch: 2492ms

RafaelGSS commented 2 years ago

Hey!

So, I did an investigation on undici.fetch after the https://github.com/nodejs/node/pull/44048 and basically, we are still slow.

As an abstract, all the points raised in the first analysis are still valid (WebStreams, AbortController, TCP Retransmission). Apart from WebStreams, it seems the next issue of fetch is memory usage. Garbage Collection uses a considerable amount of time.

Furthermore in: https://nova-option-465.notion.site/Undici-Fetch-Performance-Analysis-v2-81113b868be44b7a8f4dad03372db1db

silverwind commented 2 years ago

I wonder if the AbortController creation is really necessary when signal is not provided by the user. Seems wasteful to create it when it's never used.

Generally I thought AbortController is meant to always be created in user code, not internally inside a fetch implementation.

jimmywarting commented 2 years ago

Generally I thought AbortController is meant to always be created in user code, not internally inside a fetch implementation.

An abort signal is created weather you want it or not, simple web test new Request('/').signal

KhafraDev commented 1 year ago

An abort signal is created weather you want it or not, simple web test new Request('/').signal

It's possible to only create a signal when accessed, since signal is a getter. I'm unsure if this would improve performance though.

There isn't much we can do with it; it's public and used directly below where we create it. I assume this is also the case for most structures in the code.

There does seem to be a performance improvement from v18.10.0 -> v18.11.0, which is probably attributed to the AbortController changes.

v18.11.0:

``` ╔══════════════╤═════════╤════════════════╤═══════════╗ ║ Slower tests │ Samples │ Result │ Tolerance ║ ╟──────────────┼─────────┼────────────────┼───────────╢ ║ global fetch │ 3500 │ 3116.63 op/sec │ ± 0.98 % ║ ╟──────────────┼─────────┼────────────────┼───────────╢ ║ Fastest test │ Samples │ Result │ Tolerance ║ ╟──────────────┼─────────┼────────────────┼───────────╢ ║ undici fetch │ 3500 │ 3149.77 op/sec │ ± 0.99 % ║ ╚══════════════╧═════════╧════════════════╧═══════════╝ ```

v18.10.0:

``` ╔══════════════╤═════════╤════════════════╤═══════════╗ ║ Slower tests │ Samples │ Result │ Tolerance ║ ╟──────────────┼─────────┼────────────────┼───────────╢ ║ undici fetch │ 6000 │ 2911.89 op/sec │ ± 0.97 % ║ ╟──────────────┼─────────┼────────────────┼───────────╢ ║ Fastest test │ Samples │ Result │ Tolerance ║ ╟──────────────┼─────────┼────────────────┼───────────╢ ║ global fetch │ 3500 │ 2998.01 op/sec │ ± 0.90 % ║ ╚══════════════╧═════════╧════════════════╧═══════════╝ ```

silverwind commented 1 year ago

I updated my unscientific fetch benchmarks for parallel fetch and I think the situation has slightly improved for undici. I'm now also benchmarking deno's fetch and it does seem supremely fast for this specific use case (many parallel requests):

axios: 2564ms
node-fetch: 5142ms
undici: 3590ms
deno: 1761ms

mcollina commented 1 year ago

You might want to add undici.request(), which should hopefully be in the same bracket of Deno.

Most of undici.fetch overhead come from Node.js implementation of WHATWG Streams.

BSN4 commented 1 year ago

You might want to add undici.request(), which should hopefully be in the same bracket of Deno.

Most of undici.fetch overhead come from Node.js implementation of WHATWG Streams.

@mcollina I've tried this still slower than deno and bun :

root@:~# node node.js
request-time: 299.406ms
Status code: 200
node version: v19.4.0

root@:~# bun bun.js
[229.99ms] request-time
Status code: 200
bun version: 0.5.0

root@:~# deno run --allow-net deno.js
request-time: 226ms
Status code: 200
deno version: 1.29.4

root@:~# ./go
Time taken: 242 ms
Status code: 200