nodejs / performance

Node.js team focusing on performance
MIT License
371 stars 7 forks source link

Buffer/Uint8Array using lots of memory #173

Closed SeanReece closed 1 week ago

SeanReece commented 3 weeks ago

Tested in NodeJS Versions: v22.3.0, v20.14.0, v18.20.3, v16.20.2

It appears as though Buffer/Uint8Array consumes much more memory than I would expect. This is particularly obvious with many small instances.

For example:

const data = new Uint8Array(12) <-- I would think this would consume ~12bytes

It appears to have a shallow size of 96bytes and retains 196bytes image

I'm not sure if this is a V8 issue but when I try the same in Chrome 126 I see a similar issue but it uses slightly less memory

image

Why this is an issue

I stumbled on this while trying to profile memory issues while pulling large amounts of MongoDB documents into memory, even projecting the documents to just return 2 ObjectIds each (we're building potentially large graphs in memory from the links).

A BSON ObjectId is 12 bytes. So we estimated ~24MB per million edges. (maybe a bit more for object overhead etc) In reality this uses almost 500MB

At first I thought this was an issue with BSON's implementation but this can be recreated using Uint8Array directly.

Try it out

const arr = []

const heapBefore = process.memoryUsage().heapUsed
for (let i = 0; i < 2000000; i++) {
  arr.push(new Uint8Array(12)) // Same with Buffer.alloc(12)
}
const heapAfter = process.memoryUsage().heapUsed   // Not super accurate but illustrates the issue
const size = Math.round((heapAfter - heapBefore) / 1024 / 1024)
console.log(`Used ${size}MB to store ${arr.length} Uint8Array(12)`)
// Used 473MB to store 2000000 Uint8Array(12)

It doesn't appear that the memory used increases much with the size of the Uint8Array. Doubling the size of each Uint8Array from 12 -> 24 only increases the memory usage to 485MB in the above test. This tells me there's probably some overhead in the data structure itself than some data being duplicated or something.

Curiously, when I try the same thing with Buffer.from(new Uint8Array(12)) it only outputs ~240MB. I assume this is because buffer doesn't keep a reference to something(?) and GC happens sometime before capturing heapUsed.

See below when using Buffer.from(new Uint8Array(12)) it retains 100bytes less 🤔 Screenshot 2024-06-13 at 2 15 53 PM

Thanks

Big thanks to the Node.js Performance Team in advance. You're doing amazing work :+1: Please let me know if this is an issue with V8 directly or if this is completely expected behaviour. It really caught me off guard.

lemire commented 3 weeks ago

const data = new Uint8Array(12) <-- I would think this would consume ~12bytes

I would not make this assumption. I would expect at least, say, 48 bytes and up to 256 bytes even for an empty Uint8Array instance.

Have a look at these blog posts:

Merely storing a single integer in a set in C++ can take 32 bytes !!!

There is just no way that creating a whole new Uint8Array instance is nearly free even if it were empty.

Now, if you create sizeable Uint8Array (e.g., 128 bytes), you should expect that the array buffers would grow by roughly 128 bytes, but even there, you are discarding the instance overhead.

Can you run the following code and tell me what you get?

var arr = new Array();
let count = 0;
let unit = 128;
for(let i = 0; i < 10000; i++) {
  arr.push(new Uint8Array(unit));
  count += unit;
  console.log(count+" "+process.memoryUsage().arrayBuffers+" "+process.memoryUsage().arrayBuffers/count);
}

I stumbled on this while trying to profile memory issues while pulling large amounts of MongoDB documents into memory, even projecting the documents to just return 2 ObjectIds each (we're building potentially large graphs in memory from the links). A BSON ObjectId is 12 bytes. So we estimated ~24MB per million edges. (maybe a bit more for object overhead etc) In reality this uses almost 500MB

I would allocate a buffer new Uint8Array(24000000) and then store my ObjectIds at index 0, 12, 24, ...

H4ad commented 3 weeks ago

Buffer.from uses internal pool to avoid allocating many small buffers, maybe this is helping reducing the memory allocation a little bit.

SeanReece commented 1 week ago

Thanks for the info @lemire. You're correct that there seems to be lots of overhead for each TypedArray created, and creating a single large typed array really does only consume the memory I was expecting.

I've been doing some digging and found this interesting explanation from a V8 developer:

https://stackoverflow.com/questions/45803829/memory-overhead-of-typed-arrays-vs-strings/45808835#45808835

I also tried the same with ArrayBuffers + DataView with very slightly better memory efficiency. But that is somewhat moot since ObjectIds can be represented as a 24 character hex string, which only consumes 40 bytes in V8, which is much better than Buffer consuming 96 bytes to represent the same raw 12 bytes.

I would allocate a buffer new Uint8Array(24000000) and then store my ObjectIds at index 0, 12, 24, ...

I don't really have much control over this in our implementation since bson is instantiating lots of Buffers under the hood.

Do you know of any good libraries for managing disparate data within a large arrayBuffer? There's some complexity around removing unused elements and redistributing the available space.

Thanks again for your insight here. I think we can close this since it does not seem to be an NodeJS issue directly.

lemire commented 1 week ago

I don't really have much control over this in our implementation since bson is instantiating lots of Buffers under the hood.

You can grab the returned buffer and copy it to your own larger buffer.

There's some complexity around removing unused elements and redistributing the available space.

Your project does end up looking like you are trying to build your own custom database engine... which is unavoidably going to require some engineering effort.

joyeecheung commented 1 week ago

FWIW when I investigated https://github.com/nodejs/node/issues/53579 I noticed that even an empty array buffer in V8 takes 88 bytes, which is surprisingly big if you ask me. But that also has something to do with us not turning on pointer compression + V8 sandbox (otherwise it would've been ~44 bytes). Also not all the fields are strictly necessary for all array buffers but they are there in advance, or there should've been some clever ways to encode them to save space. But that could incur additional code complexity in V8 that makes it not worth it, and it's mostly a V8 issue.

lemire commented 1 week ago

@joyeecheung So an empty buffer is made of 11 pointers? That sounds like a lot.