Faster binary data writing to socket

streamich commented 1 year ago

uWebSockets.js accepts ArrayBuffer-and-co JavaScript binary formats when writing to the socket, that is great, but is not enough.

All data encoders pre-allocate more memory than needed, but current API does not allow to effectively provide only the slice of the final encoded data, instead of creating a new buffer-like object.

A simplified example:

const uint8 = new Uint8Array(1024);
const size = encodeMessage(uint8, message);
const output = uit8.subarray(0, size); // <--------- This is a problem, .subarray() is very slow.

res.end(output);

Instead, responses could be sped up substantially (and consume less memory), with an API as follows:

const uint8 = new Uint8Array(1024);
const size = encodeMessage(uint8, message);

res.end(uint8.buffer, 0, size);

, where 0 is offset and size is the length of binary chunk to copy. (No call to .subarray(), which is very slow.)

Constructing ArrayBuffer, or Uint8Array, or Buffer are very, very slow in Node.js and every release, at least since Node v14, it is getting slower.

I would like to propose an optimized API, where the binary chunk slice can be specified using offset and length params, instead of creating a temporary ArrayBuffer or Uint8Array.

Below are three options for this new API.

Option 1

This would be the fastest option:

interface HttpResponse {
  endFast(buf: ArrayBuffer, offset: number, length: number): void;
  writeFast(buf: ArrayBuffer, offset: number, length: number): void;
}

The .end() and .write() are the most performance critical methods, so .writeHeader() and .writeStatus() could be ignored here.

Option 2

This option is slower, but would integrate nicely with the existing API:

type RecognizedString = RecognizedString | Slice;

interface Slice {
  buf: ArrayBuffer;
  offset: number;
  length: number;
}

// or

type Slice = [buf: ArrayBuffer, offset: number, length: number];

constructing a new instance of Slice is about 30x faster than creating a "slice" using ArrayBuffer or Uint8Array.

In TypeScript:

class Slice {
  constructor (public readonly buf: ArrayBuffer, public readonly offset: number, public readonly length: number) {}
}
const slice = new Slice(uint8.buffer, 0, size);

// or

type Slice = [buf: ArrayBuffer, offset: number, length: number];
const slice = [uint8.buffer, 0, size];

This option would integrate well also with the .writeHeader and .writeStatus methods.

uWebSockets could provide the Slice class, which V8 would hopefully compile as hidden class:

import {Slice} from 'uWebSockets.js';

const slice = new Slice(uint8.buffer, 0, size);

Option 3

This option is similar to Option 2, but instead of extending the RecognizedString type, various existing methods would receive optional offset and length parameters:

interface HttpResponse {
  write(data: RecognizedString): number;
  write(data: RecognizedString, offset: number, length: number): number;

  end(body?: RecognizedString, closeConnection?: boolean, offset?: number, length?: number): HttpResponse;
}

The same applies to the WebSocket instances.

streamich commented 1 year ago

Here is a simple benchmark, which shows how slow creating a buffer slice using a buffer-like objects is:

Note:

The new Uint8Array or Uint8Array.prototype.subarray are the fastest methods I know to create a "slice". But they are still 20x slower than new Slice() in Node v20.
From Node v14 to Node v20, speed of instantiating a new Uint8Array is getting slower.
Speed of creating a ArrayBuffer is in s, not ms
Creating a new Buffer is not in the benchmark, but is about the same to a bit slower than Uint8Array.

e3dio commented 1 year ago

Please add to your benchmark the new Node v20 ArrayBuffer resize method https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/ArrayBuffer/resize

streamich commented 1 year ago

Thanks @e3dio, good to know about ArrayBuffer.ptototype.resize(), unfortunately it is the worst performing one:

And it does not solve the problem, as:

That method assumes that offset is always 0, which is not the case.
That method assumes that you own the whole ArrayBuffer, which also is not the case, the message encoder will share one large ArrayBuffer, encode multiple messages into it and hand out the slices.
Even if ArrayBuffer.ptototype.resize() would solve this problem, it is available only in the next Node/V8 version; which will not get wide adoption until a couple of years.

Also, I just realized that this benchmark is not a good one for the .resize() method; as it resizes it "up" as well :)

streamich commented 1 year ago

Created a better benchmark for .resize(). In this benchmark the buffer is only resized down:

Still, .resize() is about as slow as creating a new Uint8Array. About an order of magnitude slower than creating a Slice.

But more importantly, even if .resize() was fast, it fails on these points:

That method assumes that offset is always 0, which is not the case.

That method assumes that you own the whole ArrayBuffer, which also is not the case, the message encoder will share one large ArrayBuffer, encode multiple messages into it and hand out the slices.

streamich commented 1 year ago

res.endFast(buf, offset, length) would be even faster than the new Slice() method. Essentially if we only added these two methods, it would solve 90% of the problem:

interface HttpResponse {
  endFast(buf: ArrayBuffer, offset: number, length: number): void;
  // Always 3 arguments, does not close connection.
}

interface WebSocket {
  sendFast(buf: ArrayBuffer, offset: number, length: number): void;
  // Always 3 arguments, binary message, default compression.
}

uasan commented 1 year ago

You want to use buffer as a shared buffer, but how do you deal with race conditions when the uWS reads your buffer are you already overwriting this buffer slice in the node process?

For example

cork handler is called at the moment when the uWS is ready to send data, you do not control this moment in time and before the cork call occurs, you can overwrite the buffer slice with new data for other responses.
backpressure, when a socket write is slow, does the uWS make a copy of the passed buffer or not? if not, then this is also a race mode for the buffer that is used to respond to different responses.

e3dio commented 1 year ago

Yes creating a resizable ArrayBuffer is slow and resize is also slow, that is disappointing. This is the fastest way I see, I don't see this in your benchmark:

const buf = Buffer.allocUnsafe(1000); // initial fast buffer
const buf2 = buf.subarray(0,500); // zero copy buffer view of data

Update: new Uint8Array() is slightly faster for Buffer see https://github.com/uNetworking/uWebSockets.js/issues/894#issuecomment-1529103753

uasan commented 1 year ago

The subarray method returns a ref to a buffer slice, this is not a copy of the buffer, as I wrote above, without solving the race problems, you will not be able to use one buffer for many responses, I advise you to study the performance of the new method that returns a copy of the buffer

Buffer.copyBytesFrom

e3dio commented 1 year ago

uWS.js send/end/write takes a Buffer, there is nothing mentioned about a Buffer view not working, I think uWS copies data on method call so buf.subarray() should work which is very fast

streamich commented 1 year ago

@e3dio .subarray() is very slow. It is almost identical to new Uint8Array() and you can see performance of that in the screenshots above.

streamich commented 1 year ago

@uasan yes, my understanding is that uWS makes a copy of the passed in data. .copyBytesFrom is not going to help here, the whole idea is to copy bytes and allocate new buffer-like objects as little as possible.

e3dio commented 1 year ago

.subarray() is very slow. It is almost identical to new Uint8Array()

I don't see subarray in your benchmark

streamich commented 1 year ago

@e3dio included .subarray():

streamich commented 1 year ago

Creating a temporary Uint8Array just to represent a "slice" is a lot of unnecessary memory usage, see in this StackOverflow answer how much fields each Uint8Array instance holds.

In summary, each string consumes 5*8 = 40 bytes, each typed array consumes 26*8 = 208 bytes.

Essentially, creating a temporary Uint8Array using .subarray() or new Uint8Array() is extra 208 bytes that need to be allocated and then immediately garbage collected.

In, something like

response.endFast(uint8, offset, length);
socket.sendFast(uint8, offset, length);

no need to allocate extra 208 bytes. (A WebSocket message itself could be way less than this 208 bytes overhead.)

For 100K messages per second, that is extra 20.8 MB/s of work for V8 allocator and garbage collector.

streamich commented 1 year ago

Just to summarize my case:

In a non-trivial performance-cautious application there will be an encoder, which encodes responses into binary format.
No encoder will return you the exact ArrayBuffer of the message.
- Typically, the message will be just some slice of a larger shared Uint8Array.
- Or, if the encoder returns your own ArrayBuffer just for your message, it will be oversized. Because encoders initially allocate more space as they don't know the size of encoded data ahead of time.
So, currently, there is no efficient way to return the exact "slice".
- Currently the best way is to use .subarray() which is very slow and allocates 208 bytes for every message.

uws.get('/rpc', (res, req) => {
  const result = rpc.exec(/* ... */);
  const slice = encoder.encode(result);

  // very slow:
  res.end(slice.buf.subarray(slice.start, slice.end));

  // would be nice to have:
  res.endFast(slice.buf, slice.start, slice.end);
})

streamich commented 1 year ago

@e3dio I'm not sure I understand what you mean. Here, Buffer.prototype.subarray() is even 2x slower than Uint8Array.prototype.subarray():

Also, Buffer is legacy. Everyone is moving away from it to use Uint8Array instead. There is this myth that Buffer.allocUnsafe() is fast because it has a shared pool of 4KB sized Buffers. But I haven't been able to see any performance benefit of tapping into that pool.

streamich commented 1 year ago

Added res.end() case to show the performance impact of res.end(new Slice()) over res.end(buf, offset, length):

ronag commented 1 year ago

If you need a node Buffer there is a little trick you an use:

const FastBuffer = Buffer[Symbol.species]

const buf2 = new FastBuffer(buf1.buffer, buf1.byteOffset + offset, length) // buf1.subarray(offset ,length)

Or something along those lines.

ronag commented 1 year ago

as compared to what the Node.js core developers have done

To be fair... this is a problem with UInt8Array which Node doesn't implement, it's V8.

uasan commented 1 year ago

Using the cork handler is important for performance, which means you need to take care of slice immutability before calling cork, i.e. you will have to create temporary states like semophores, which in the end will load the GC more than 208 bytes )

uNetworkingAB commented 1 year ago

151 ms to create 10 million slices does not sound like a problem. Say that you will be capped at 300k req/sec at best - that would put the total overhead of slicing at 0.4% if I did my math correctly. This is not a bottleneck, esp. not considering how sluggish JavaScript is in comparison.

streamich commented 1 year ago

The math is correct, assuming 300K req/sec, V8 will alloc/dealloc ~60 MB/sec unnecessarily and spend 0.45% or 4.6ms each second being blocked on creating slices.

That is assuming the developer will use the most efficient .subarray() method, if they use something less efficient, say some .subarray() on some other prototype or, .resize() or .slice() on ArrayBuffer or something else, it could easily be 2x (or more) slower. Which would be 0.9%, almost 1%.

streamich commented 1 year ago

Would you accept a PR if I implemented something like below?

interface WebSocket {
  sendSpecialOrSomethingLikeThat(buffer: ArrayBuffer, offset: number, length: number): void;
}

e3dio commented 1 year ago

@streamich here is good benchmark I think

Click to expand benchmark

```javascript const iterations = 1e8; const ab = new ArrayBuffer(1024 * 4); const arr = new Uint8Array(ab); const buf = Buffer.from(ab); const bench = (name, fn) => { console.time(name); for (let i = 0; i < iterations; i++) fn(i % 1024); console.timeEnd(name); }; bench('Uint8Array-ab', i => { new Uint8Array(ab, i, 1); }); bench('Uint8Array-arr', i => { new Uint8Array(arr.buffer, i + arr.byteOffset , 1); }); bench('Uint8Array-buf', i => { new Uint8Array(buf.buffer, i + buf.byteOffset, 1); }); bench('subarray-arr', i => { arr.subarray(i, i + 1); }); bench('subarray-buf', i => { buf.subarray(i, i + 1); }); ```

Results:

Uint8Array-ab: 8s (cheating: did not start from Buffer or Uint8Array) Uint8Array-arr: 20s Uint8Array-buf: 20s subarray-arr: 15s subarray-buf: 23s

So if you start from Buffer, new Uint8Array() is 15% faster than subarray() at 20s vs 23s. If you start from Uint8Array, subarray() is 50% faster than new Uint8Array() at 15s vs 20s. My encoding library uses Buffer so I would use new Uint8Array()

uNetworkingAB commented 1 year ago

If we're talking an overhead of 0.4% and ~200 bytes per slice I don't really see how this is even remotely an issue. It just sounds like a natural property of a scripting language. Scripting languages aren't zero overhead.

e3dio commented 1 year ago

The existing uWS.js API should be fine. Use subarray or new Uint8Array as described above to send uWS.js a zero copy slice/view of buffer

uNetworkingAB commented 1 year ago

There is a distinction between what is in scope and what is not. We can't add bypasses for custom use cases such as slice-and-send or extract-and-send or receive-to-json or anything like that. Then we would just add a bunch of hacks to workaround shortcomings of the language. Either accept the shortcomings of scripting or swap to the C++ or C library if you need guaranteed lowest overhead.

0.4% is not even a low hanging fruit

streamich commented 1 year ago

In my RPC requests and responses have schema, and encoder is JIT compiled according to that schema, which writes straight to big shared Uint8Array, so there are almost no allocations. Overall the encoding is almost 10x faster than Buffer.from(JSON.stringify(data)), effectively this makes this last 0.5% bit of .subarray() the slowest part in the whole encoding process, which I wanted to eliminate.

e3dio commented 1 year ago

I think you have effectively optimized your process to the maximum, and you are safe to move on to something else ;) Also shameless plug check out my encoding library for smallest possible size, speed, and ease of use https://github.com/e3dio/packBytes Although I need to update the Buffer sizing mechanism, currently it calculates the exact buffer size needed instead of estimating and sending a view as described in this issue, I need to update that to improve speed

streamich commented 1 year ago

@e3dio thanks, yeah will take a look.

uNetworkingAB commented 1 year ago

In my RPC requests and responses have schema, and encoder is JIT compiled according to that schema, which writes straight to big shared Uint8Array, so there are almost no allocations. Overall the encoding is almost 10x faster than Buffer.from(JSON.stringify(data)), effectively this makes this last 0.5% bit of .subarray() the slowest part in the whole encoding process, which I wanted to eliminate.

I totally salute your dedication but JavaScript is not the right place to be if you care about 0.4% slicing overhead.

ronag commented 1 year ago

There is also the overhead of GC of all the temporaries which might not be represented in the benchmarks...

uNetworkingAB commented 1 year ago

Typically generational GC do not promote young objects that die young (that's the GC way of doing stack variables) to full on GC'd objects

uNetworking / uWebSockets.js