Size difference between node and browser

Marius-Romanus commented 1 year ago

Hi, there is a size difference calculating the same string type between the browser and node.

I understand that being only a string and not having objects or anything weird, it should be the same size, right?

Greetings!.

console.log("node sizeof()", sizeof('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ac vestibulum lacus, sit amet maximus libero. Aliquam erat volutpat. Quisque at orci tortor. Donec at mi nunc.')); node sizeof() 184

console.log("browser sizeof()", sizeof('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ac vestibulum lacus, sit amet maximus libero. Aliquam erat volutpat. Quisque at orci tortor. Donec at mi nunc.')); browser sizeof() 342

miktam commented 1 year ago

OK, the difference is coming from here https://github.com/miktam/sizeof/blob/master/indexv2.js#L88

Node.js uses precise string calculation. Here is the PR https://github.com/miktam/sizeof/pull/80

The browser uses quite a simplistic approach, assuming that every string char is 2 bytes.

To be precise in the browser environment, let me check if there is a difference between different VM implementations.

Marius-Romanus commented 1 year ago

Hello, I've been doing some research and it seems that the best options are.

For node: Buffer.byteLength(string); For browser: (new TextEncoder().encode(string)).length;

I think this library has a good approach: https://github.com/ehmicky/string-byte-length

What I don't know is the compatibility that you give since that library is in node version >=14.18.0

Regarding TextEncoder it seems to have good compatibility: https://caniuse.com/?search=TextEncoder

In the example that I have given, in both cases it gives a size of 171, which does not match what it gives now.

With a complex emoji gives: 🏳️‍🌈 gives 14 And with a simple emoji: 😀 gives 4

I also don't know if it differs with Cyrillic, Arabic, Chinese characters, etc.

Greetings

miktam commented 1 year ago

@Marius-Romanus thank you for the investigation! browser-based implementation seems useful, I added it here https://github.com/miktam/sizeof/pull/83

Regarding node.js version, compatibility might be the issue, as you rightfully noted. the current implementation is providing similar results (184 in the current version vs 171)

Marius-Romanus commented 1 year ago

Hello, Buffer.byteLength exists in Node since the first versions, but I think it has been modified many times and I don't know the expected result in each of them or possible errors: https://nodejs.org/docs/latest-v0.10.x/api/buffer.html#buffer_class_method_buffer_bytelength_string_encoding

Although I imagine that you have already seen it but I leave you the documentation (you can pass the type of encoding): https://nodejs.org/dist/latest-v18.x/docs/api/buffer.html#static-method-bufferbytelengthstring-encoding

@ehmicky may have put the compatibility in for something else, or even for ECMAScript imports in Node. ;)

Greetings

ehmicky commented 1 year ago

Hi everyone,

I am not completely sure I am answering your question correctly, but the reason this module does not support Node 12 is because Node 12 is not officially supported anymore. Also, please note Node 14 official support will be dropped in 2 months.

The main advantage of using string-byte-length directly instead of inlining Buffer.byteLength(string) and (new TextEncoder().encode(string)).length is that this library switches between 3 different implementations depending on the platform and input size, in order to give the best performance (see benchmarks).

Also, I think you might want to distinguish UTF-8 and UTF-16 when discussing about sizes. A string only has a specific byte size for a given encoding. As pointed out in your README, the JavaScript specification considers strings to be conceptually "somewhat" UTF-16, i.e. each character is 2 bytes. I mentioned "somewhat" because surrogate characters (U+d800 to U+dfff) and astral characters (U+10000 and above) are handled a little differently, and it depends on the JavaScript method being used.

However, in memory, over the network, or in a file, those strings are likely to be encoded in UTF-8, where each character can be 1, 2, 3 or 4 bytes long. string-byte-length gives out the UTF-8 size, not the UTF-16 size, and so does Buffer.from() and new TextEncoder(). IMHO knowing the UTF-8 size is more useful than UTF-16 in most use cases.

If you're interested about this topic, I wrote the following article which details the differences.

miktam commented 1 year ago

ok, latest PR works in node v12, but does not work in v10.

Let´s see if this is the best we can have.

miktam / sizeof

Size difference between node and browser #82