Optimizing hashing performance

fabiospampinato commented 7 months ago

What is the problem this feature will solve?

Making the hash functions significantly faster.

What is the feature you are proposing to solve the problem?

I'm not sure what the best option for this is, but running the following file:

import crypto from 'node:crypto';

const value = '/Users/fabio/path/to/some/random/file.js';

console.time ( 'crypto' );
for ( let i = 0; i < 100_000; i++ ) {
  crypto.createHash("sha1").update(value).digest("hex");
}
console.timeEnd ( 'crypto' );

I see the following output:

~ ❯ node hashbench.js
crypto: 106.557ms
~ ❯ node hashbench.js
crypto: 108.052ms
~ ❯ node hashbench.js
crypto: 107.642ms
~ ❯ node hashbench.js
crypto: 107.136ms
~ ❯ node hashbench.js
crypto: 107.747ms
~ ❯ bun hashbench.js
[29.98ms] crypto
~ ❯ bun hashbench.js
[33.87ms] crypto
~ ❯ bun hashbench.js
[30.74ms] crypto
~ ❯ bun hashbench.js
[30.78ms] crypto
~ ❯ bun hashbench.js
[30.12ms] crypto

Basically Node's sha1 function seems at least 3x slower than Bun's.

Hashing is at the core of many important things, so I'd argue it's important to hash as fast as possible since that would speed up a cascade of use cases.

I'm not sure what the best solution is here, but if Bun is 3x faster than Node here presumably there's a lot of room for improvement.

What alternatives have you considered?

No response

H4ad commented 7 months ago

I spent a few minutes trying to investigate this and here are some numbers.

Bun

[39.36ms] crypto

Node.js 16.20.1

crypto: 85.105ms

Node.js 18.18.1

crypto: 105.906ms

Node.js 20.9.0

crypto: 91.476ms

Node.js 21.1.0

crypto: 90.148ms

About bun, the update function is handled by CryptoDigest.cpp.

Apparently, bun uses openssl 1.x, Node.js 16.x also uses that same version, after 16.x Node.js started using openssl 3.x

Instead of EVP_* functions (code), bun uses SHA1 Functions (code).

I updated the benchmark to:

const crypto = require('node:crypto');

const value = '/Users/fabio/path/to/some/random/file.js';
const hash = crypto.createHash("sha1");

const start = performance.now();
hash.update(value)
console.log(`End: ${performance.now() - start}`);

Bun: 0.0250780000000006ms Node.js 21.1.0: 0.03838299959897995ms Node.js 21.1.0 (with %OptimizeFunctionOnNextCall(hash.update)): 0.03241100162267685ms

So, even calling only update is slower than bun, so maybe using SHA1_UPDATE instead of EVP_DigestUpdate can help.

Uzlopak commented 7 months ago

But is is even recommendable to use non EVP Calls?

Uzlopak commented 7 months ago

Btw. It is crazy how slow it is to sha256 hmac a string. It is something i cant optimize in probot and results in about 6k req/s performance of a probot server.

H4ad commented 7 months ago

But is is even recommendable to use non EVP Calls?

I don't have any idea, in the documentation page, they deprecated some overloads of SHA1 functions but they still support it if you change the arguments to pass context, data, length.

So, I think is safe to use.

Btw. It is crazy how slow it is to sha256 hmac a string. It is something i cant optimize in probot and results in about 6k req/s performance of a probot server.

Once I cached all the string and then performed only one update and digest, this improved the performance significantly.

tniessen commented 7 months ago

Apparently, bun uses openssl 1.x, Node.js 16.x also uses that same version, after 16.x Node.js started using openssl 3.x

There are various known performance issues related to OpenSSL 3. It's possible that they negatively affect performance of hashing through EVP_* as well, but someone would have to benchmark the C/C++ code itself to be sure.

Instead of EVP_* functions (code), bun uses SHA1 Functions (code).

The EVP_* functions are recommended over low-level functions etc. The latter are gradually being deprecated and using them might break compatibility with OpenSSL providers/engines.

FWIW, I proposed crypto.hash() at some point (see https://github.com/nodejs/node/pull/42233), which would be faster than crypto.Hash for one-shot hashing. It might still be slower than an equivalent implementation using OpenSSL 1.1.1 and low-level OpenSSL APIs.

ronag commented 7 months ago

Are we using OpenSSL 3.0 or 3.1?

Uzlopak commented 7 months ago

@ronag A fork of 3.0 with quic patched. No original OpenSSL 3.0

We are basically cut off from direct OpenSSL updates.

ronag commented 7 months ago

So basically we are stuck with this until we update OpenSSL.

ronag commented 7 months ago

Anyone considered switching to BoringSSL?

Uzlopak commented 7 months ago

I guess a PR to switch to BoringSSL would result in a lot of negative feedback. Also BoringSSL does not guarantee stable ABI.

https://boringssl.googlesource.com/boringssl/+/HEAD/PORTING.md#porting-from-openssl-to-boringssl

Note: BoringSSL does not have a stable API or ABI. It must be updated with its consumers. It is not suitable for, say, a system library in a traditional Linux distribution. For instance, Chromium statically links the specific revision of BoringSSL it was built against. Likewise, Android's system-internal copy of BoringSSL is not exposed by the NDK and must not be used by third-party applications.

E.g. @mcollina mentioned that stable ABI is important for him. So I assume for other it is also important.

ronag commented 7 months ago

E.g. @mcollina mentioned that stable ABI is important for him. So I assume for other it is also important.

Is this a theoretical issue or a practical one? Do they actually break the ABI that often? I think quic support is better in BoringSSL so at the end of the day it might be easier to use?

targos commented 7 months ago

https://boringssl.googlesource.com/boringssl/+/HEAD/README.md

Although BoringSSL is an open source project, it is not intended for general use, as OpenSSL is. We don't recommend that third parties depend upon it. Doing so is likely to be frustrating because there are no guarantees of API or ABI stability.

tniessen commented 7 months ago

BoringSSL also lacks various features that users of Node.js might rely on and that'd we'd have to polyfill somehow, e.g., certain algorithms.

GeoffreyBooth commented 7 months ago

@nodejs/performance @nodejs/crypto

lemire commented 7 months ago

At a glance, this does not look like a hashing performance issue.

lemire commented 7 months ago

Try the following... Start with this script... (call it hash.mjs)


import crypto from 'node:crypto';

const value = '/Users/fabio/path/to/some/random/file.js';

console.time ( 'crypto' );
for ( let i = 0; i < 1000_000; i++ ) {
  crypto.createHash("sha1").update(value).digest("hex");
}
console.timeEnd ( 'crypto' );

(I picked the script as-is, but I extended to loop to 1000_000 repetitions.)

The conjecture is that the hashing is slow, and by that, presumably, we mean the update call. Ok. Let us trim the code... create a new script that does not update... (call it bhash.msj)


import crypto from 'node:crypto';

console.time ( 'crypto' );
for ( let i = 0; i < 1000_000; i++ ) {
  crypto.createHash("sha1");
}
console.timeEnd ( 'crypto' );

If hashing done by update is the bottleneck, this second script should be very fast compared to the previous one. Let us see... I use Node.js v20.10.0.

$ node hash.mjs 
crypto: 1.442s
$ node bhash.mjs 
crypto: 1.039s

Yep. It is not hashing. Only 0.4 s are spent on hashing. The bulk of the time is spent on crypto.createHash("sha1").

Let us try with Bun now...

$ bun hash.mjs 
[750.63ms] crypto
$ bun bhash.mjs 
[105.19ms] crypto

So bun spends 0.64 s on hashing... roughly as much as Node.js. The big difference between Node and Bun is that Bun is 10 times faster on the crypto.createHash("sha1") part.

So let us try something else... let us call crypto.createHash once:

import crypto from 'node:crypto';

const value = '/Users/fabio/path/to/some/random/file.js';

console.time ( 'crypto' );
const hasher = crypto.createHash("sha1");
for ( let i = 0; i < 1000_000; i++ ) {
  hasher.update(value);
}
console.timeEnd ( 'crypto' );

I get...

$ node chash.mjs 
crypto: 172.675ms
$ bun chash.mjs 
[170.99ms] crypto

Yep. So we have narrowed it down to crypto.createHash.

So let us profile the bhash.mjs script with Node:

   7.69%  node     libc.so.6             [.] pthread_rwlock_unlock@@GLIBC_2.34
   7.35%  node     libc.so.6             [.] malloc
   5.51%  node     libc.so.6             [.] pthread_rwlock_rdlock@GLIBC_2.2.5
   4.77%  node     node                  [.] Builtins_GetProperty
   4.65%  node     libc.so.6             [.] _int_free
   4.10%  node     libc.so.6             [.] _int_malloc

So we are spending time allocating memory and doing some threading. These are expensive jobs...

And now Bun...

   5.08%  bun          bun                   [.] 0x0000000002e52279
   4.26%  bun          bun                   [.] 0x0000000002e52897
   3.56%  bun          bun                   [.] 0x00000000030aabdf
   3.55%  bun          libc.so.6             [.] __memset_evex_unaligned_erms
   2.20%  bun          libc.so.6             [.] __strncasecmp_l_evex

My conclusion is that though the problem might come from OpenSSL, and hashing... it does not appear to be so. As you can see, just Builtins_GetProperty seems to be a bottleneck.

Maybe we should get the advice of someone who understands deeply the C++ architecture: @joyeecheung

Note: I have not disproved that Bun could have an advantage when hashing. It probably does. I am just saying that on this particular benchmark, the processor I used (Linux+x64), with a small string, hashing is not the bottleneck. Node.js has lost the race before it has even begun.

joyeecheung commented 7 months ago

I did some profiling to see where the bottleneck is - it seems https://github.com/nodejs/performance/issues/136#issuecomment-1836891421 was based on only self-time, but it's still important to look at the call graph.

The rwlock comes from OpenSSL methods e.g. evp_md_init_internal
Builtins_GetProperty comes from Builtins_InstanceOf which comes from the instanceof in argument validations in the JS-land Hash
The creation of BaseObject incurs a significant overhead (partly due to the global handle creation though I think another part is the cleanup queue maintenance, which are inlined and not shown in the profile). I did a prototype in https://github.com/nodejs/node/pull/51017 to use Oilpan to manage the Hash object instead and createHash() becomes 15-17% faster. Note that V8 is also working on conservative stack scanning. AFAIU this makes the handle management closer to what JSC has so we might get speedup of handle management for free without doing something like this, though it's unclear when they'll ship it https://bugs.chromium.org/p/v8/issues/detail?id=13270

const { createHash } = require('node:crypto');

const array = [];
for ( let i = 0; i < 1000_000; i++ ) {
  array.push(null);
}
const start_time = performance.now();
for ( let i = 0; i < 1000_000; i++ ) {
  array[i] = createHash("sha1");
}
console.log(performance.now() - start_time);
console.log(array.length);

On main the profile of the snippet above looks like this (simplified with only the important bits, number is total time in the process being profiled, the % outside these are mostly setup/teardown/GC stuff, or just samples with a slightly different stack)

--48.93%--node::crypto::Hash::New
  --22.40%--node::BaseObject::BaseObject
  --16.25%--node::crypto::Hash::HashInit
    --15.03%--evp_md_init_internal
  --5.34%--EVP_get_digestbyname
  --3.25%--node::Utf8Value::Utf8Value
  --4.96%--v8::internal::ApiNatives::InstantiateObject
    --3.55%--v8::internal::Heap::CollectGarbage
--7.65%--Builtins_CEntry_Return1_ArgvOnStack_NoBuiltinExit
  --5.46%--v8::internal::Runtime_AllocateInYoungGeneration
--3.82%--Builtins_InstanceOf

With the Oilpan prototype in https://github.com/nodejs/node/pull/51017 I get a somewhat different profile that looks like this

--40.03%--node::crypto::Hash::New
  --21.29%--node::crypto::Hash::HashInit
    --18.93%--evp_md_init_internal
  --7.04%--EVP_get_digestbyname
  --3.29%--node::SetCppgcReference
  --3.25%--node::Utf8Value::Utf8Value
  --8.54%--v8::internal::ApiNatives::InstantiateObject
    --6.49%--v8::internal::Heap::CollectGarbage
--6.83%--Builtins_InstanceOf
--4.40%--Builtins_CEntry_Return1_ArgvOnStack_NoBuiltinExit
  --4.23%--v8::internal::Runtime_AllocateInYoungGeneration

With a different allocation scheme it sheds off a chunk of the overhead though there are still a lot to be improved:

$./node_main run.js
1642.1952390000001
1000000

$ out/Release/node run.js
1276.719182
1000000

$ bun run.js
233.12367799999998
1000000

I think another thing that can be done is caching the EVP_MD of known algorithms (not 100% sure if that's safe to do, maybe people from @nodejs/crypto knows for sure). And maybe some JS voodoo can be done to create a fast path to avoid the instanceof overhead in argument validation.

joyeecheung commented 7 months ago

Actually regarding the OpenSSL overhead the solution (at least a partial solution) seems surprisingly simple - we are just under-initializing OpenSSL: https://github.com/nodejs/node/pull/51019

joyeecheung commented 7 months ago

Actually https://github.com/nodejs/performance/issues/136#issuecomment-1837269580 was incorrect, I was just looking at the numbers from the Oilpan prototype, initializing it with OPENSSL_INIT_ADD_ALL_DIGESTS won't make a differenct because EVP_get_digestbyname would also do that.

On another note, it may be a good idea to just move away from EVP_get_digestbyname if we are on OpenSSL 3. Locally switching to EVP_MD_fetch (but without proper lifecycle management so it leaks) can speed up createHash() by another 5-7% - though handling the life cycle of that ref-counted EVP_MD in the existing implementation can be somewhat tricky.

                                                                                confidence improvement accuracy (*)   (**)  (***)
crypto/webcrypto-digest.js n=100000 method='SHA-1' data=10 sync='createHash'           ***      5.65 %       ±1.21% ±1.62% ±2.11%
crypto/webcrypto-digest.js n=100000 method='SHA-1' data=100 sync='createHash'          ***      6.37 %       ±0.84% ±1.12% ±1.45%
crypto/webcrypto-digest.js n=100000 method='SHA-1' data=20 sync='createHash'           ***      5.71 %       ±1.10% ±1.47% ±1.91%
crypto/webcrypto-digest.js n=100000 method='SHA-1' data=50 sync='createHash'           ***      7.57 %       ±1.50% ±2.00% ±2.62%
crypto/webcrypto-digest.js n=100000 method='SHA-256' data=10 sync='createHash'         ***      6.56 %       ±1.15% ±1.52% ±1.98%
crypto/webcrypto-digest.js n=100000 method='SHA-256' data=100 sync='createHash'        ***      5.65 %       ±0.80% ±1.06% ±1.38%
crypto/webcrypto-digest.js n=100000 method='SHA-256' data=20 sync='createHash'         ***      6.73 %       ±0.98% ±1.31% ±1.70%
crypto/webcrypto-digest.js n=100000 method='SHA-256' data=50 sync='createHash'         ***      6.05 %       ±0.89% ±1.18% ±1.54%
crypto/webcrypto-digest.js n=100000 method='SHA-384' data=10 sync='createHash'         ***      6.86 %       ±1.15% ±1.53% ±2.00%
crypto/webcrypto-digest.js n=100000 method='SHA-384' data=100 sync='createHash'        ***      7.02 %       ±0.98% ±1.31% ±1.71%
crypto/webcrypto-digest.js n=100000 method='SHA-384' data=20 sync='createHash'         ***      7.28 %       ±0.92% ±1.22% ±1.58%
crypto/webcrypto-digest.js n=100000 method='SHA-384' data=50 sync='createHash'         ***      6.52 %       ±1.12% ±1.50% ±1.95%
crypto/webcrypto-digest.js n=100000 method='SHA-512' data=10 sync='createHash'         ***      5.86 %       ±1.11% ±1.48% ±1.92%
crypto/webcrypto-digest.js n=100000 method='SHA-512' data=100 sync='createHash'        ***      5.40 %       ±1.16% ±1.54% ±2.01%
crypto/webcrypto-digest.js n=100000 method='SHA-512' data=20 sync='createHash'         ***      6.29 %       ±1.07% ±1.43% ±1.87%
crypto/webcrypto-digest.js n=100000 method='SHA-512' data=50 sync='createHash'         ***      5.59 %       ±1.17% ±1.56% ±2.03%

Uzlopak commented 7 months ago

Doesnt the OpenSSL docs say something about avoiding the lookup functions like EVP_get_digestbyname? I think I read something about this. Something along the lines that OpenSSL 3 is totally object oriented and thus e.g. you register digests in an Object at runtime. So calling something like EVP_get_digestbyname means basically, that OpenSSL goes through each name and checks if it finds the proper digest.

joyeecheung commented 7 months ago

Yes, though I think we'll need a bit of internal refactoring to move on to explicit fetching because so far most of the crypto classes don't think about freeing the EVP_MD (there is also a somewhat weird (undocumented?) signature that allows the algorithm to be another Hash object, I am not sure what should happen in terms of life cycle management in that case)

joyeecheung commented 7 months ago

About the InstanceOf - it seems mostly coming from the this instanceof Hash check which is pre-ES6 way of checking construct calls. The post-ES6 way is new.target which can be much faster (in V8 that can just be just checking a register) - with https://github.com/nodejs/node/pull/51026 Builtins_InstanceOf no longer shows up in the profile.

On a side note, I see that there are still a lot of this instanceof scattering around in the code base. Probably can be dusted away too.

joyeecheung commented 7 months ago

I had another "surprisingly simple" (hopefully not just being naive) idea of migrating to EVP_MD_fetch and caching the EVP_MD. With https://github.com/nodejs/node/pull/51034 https://github.com/nodejs/node/pull/51026 and https://github.com/nodejs/node/pull/51017 I am getting this from the createHash() microbenchmark pasted above:

$ out/Release/node run.js
790.332561
1000000

$ ./node_main run.js
1634.7148000000002
1000000

At this point the most significant bottleneck is garbage collection (specifically scavenge) from createHash() and this has more to do with how the benchmark is written. If I update the snippet to generate a digest and put it into the array, the other bottleneck is the string decoding part that converts input from a string to a buffer.

With a microbenchmark like this:

'use strict';
const { createHash } = require('node:crypto');
const { readFileSync } = require('node:fs');

const kHashCount = parseInt(process.env.HASH_COUNT) || 10000;
const kInputLength = parseInt(process.env.INPUT_LENGTH) || 100;
const kUpdateCount = parseInt(process.env.UPDATE_COUNT) || 1;

const array = [];
for ( let i = 0; i < kHashCount; i++ ) {
  array.push(null);
}
const input = readFileSync('./test/fixtures/snapshot/typescript.js').slice(0, kInputLength);

const start_time = performance.now();
for ( let i = 0; i < kHashCount; i++ ) {
  const hash = createHash('sha1');
  for (let j = 0; j < kUpdateCount; j++) {
    hash.update(input);
  }
  array[i] = hash.digest('hex');
}
console.log(performance.now() - start_time);
console.log(`${kHashCount} hashes, ${kUpdateCount} updates, input length ${input.length}, ${array.length} results`);

Using a buffer as input, there are only performance differences between my local branch, the main branch, and bun when we create a lot of hashes - the bottleneck is the performance of createHash() as well as GC. When hashing a big chunk of input at one go, or hashing fewer streamed bigger input, they perform similarly.

$ HASH_COUNT=1000000 INPUT_LENGTH=100 UPDATE_COUNT=1 out/Release/node hash.js
1268.3055590000001
1000000 hashes, 1 updates, input length 100, 1000000 results
$ HASH_COUNT=1000000 INPUT_LENGTH=100 UPDATE_COUNT=1 ./node_main hash.js
2013.5803560000002
1000000 hashes, 1 updates, input length 100, 1000000 results
$ HASH_COUNT=1000000 INPUT_LENGTH=100 UPDATE_COUNT=1 bun hash.js
713.9530219999999
1000000 hashes, 1 updates, input length 100, 1000000 results

$ HASH_COUNT=1000 INPUT_LENGTH=10000 UPDATE_COUNT=100 out/Release/node hash.js
960.9329419999999
1000 hashes, 100 updates, input length 10000, 1000 results
$ HASH_COUNT=1000 INPUT_LENGTH=10000 UPDATE_COUNT=100 ./node_main hash.js
968.0744119999999
1000 hashes, 100 updates, input length 10000, 1000 results
$ HASH_COUNT=1000 INPUT_LENGTH=10000 UPDATE_COUNT=100 bun hash.js
970.135489
1000 hashes, 100 updates, input length 10000, 1000 results

$ HASH_COUNT=1 INPUT_LENGTH=10000000 UPDATE_COUNT=1 out/Release/node hash.js
9.995244000000003
1 hashes, 1 updates, input length 10000000, 1 results
$ HASH_COUNT=1 INPUT_LENGTH=10000000 UPDATE_COUNT=1 ./node_main hash.js
9.976660999999996
1 hashes, 1 updates, input length 10000000, 1 results
$ HASH_COUNT=1 INPUT_LENGTH=10000000 UPDATE_COUNT=1 bun hash.js
9.789406
1 hashes, 1 updates, input length 10000000, 1 results

And if we add a .toString() to the input before slicing it, the local branch performs a 1.2-1.6x worse than bun, here the string decoding performance becomes another the bottleneck

$ HASH_COUNT=1000000 INPUT_LENGTH=100 UPDATE_COUNT=1 out/Release/node hash.js
1306.387378
1000000 hashes, 1 updates, input length 100, 1000000 results
$ HASH_COUNT=1000000 INPUT_LENGTH=100 UPDATE_COUNT=1 ./node_main hash.js
2217.123724
1000000 hashes, 1 updates, input length 100, 1000000 results
$ HASH_COUNT=1000000 INPUT_LENGTH=100 UPDATE_COUNT=1 bun hash.js
885.439071
1000000 hashes, 1 updates, input length 100, 1000000 results

$ HASH_COUNT=1000 INPUT_LENGTH=10000 UPDATE_COUNT=100 out/Release/node hash.js
1819.652489
1000 hashes, 100 updates, input length 10000, 1000 results
$ HASH_COUNT=1000 INPUT_LENGTH=10000 UPDATE_COUNT=100 ./node_main hash.js
1804.612928
1000 hashes, 100 updates, input length 10000, 1000 results
$ HASH_COUNT=1000 INPUT_LENGTH=10000 UPDATE_COUNT=100 bun hash.js
1183.056887
1000 hashes, 100 updates, input length 10000, 1000 results

$ HASH_COUNT=1 INPUT_LENGTH=10000000 UPDATE_COUNT=1 out/Release/node hash.js
24.548640999999996
1 hashes, 1 updates, input length 10000000, 1 results
$ HASH_COUNT=1 INPUT_LENGTH=10000000 UPDATE_COUNT=1 ./node_main hash.js
24.163148999999997
1 hashes, 1 updates, input length 10000000, 1 results
$ HASH_COUNT=1 INPUT_LENGTH=10000000 UPDATE_COUNT=1 bun hash.js
20.361431000000003
1 hashes, 1 updates, input length 10000000, 1 results

One way to mitigate the GC performance impact for the first use case could be, introducing a one-shot helper (probably with EVP_Q_digest) for generating digests at one go and not creating intermediate objects. Setting performance aside that helper might be useful enough to be worth it on its own anyway.
We could continue improving the string decoding performance, for example using simdutf + fast API calls, though our hands are tied again for non-flat/two-byte strings. But that's a more generic issue than just crypto. Also this doesn't matter in the case where the user passes buffer as input. This is still somewhat a v8 issue (i.e. string->WriteUtf8() and others arent' quite performant)
In terms of GC performance in the objet-based implementation, I think beyond migrating to Oilpan there isn't much that can be done on Node.js's side, that's probably more of a V8 issue, or it may not be a valid issue in the first place given that allocating tons of objects in a tight loop to really stress the GC out is not a common pattern in the real world.

fabiospampinato commented 7 months ago

the other bottleneck is the string decoding part that converts input from a string to a buffer.

As I understand it V8 keeps track of whether a string is encoded in latin1 or not, because if it's using latin1 it can just allocate 1 byte per character rather than 2. In that case, for latin1 strings, if the rope is flattened and whatever else may be needed, shouldn't decoding be essentially free because the slice of memory that the string points to can just be hashed directly? Maybe it's more complicated than that 🤔

joyeecheung commented 7 months ago

In that case, for latin1 strings, if the rope is flattened and whatever else may be needed, shouldn't decoding be essentially free because the slice of memory that the string points to can just be hashed directly?

V8 does not hand over the raw bytes to embedders, only a copy. Also it's only a pure copy for ASCII-only strings. For strings that are Latin1 but not ASCII there's still a bit of conversion to split each code point into two bytes.

lemire commented 7 months ago

@joyeecheung

If you effectively made Node 2x faster on this benchmark… it feels like it is quite a success… it is not enough to beat Bun but it is enough so that it is no longer embarrassing for Node. Especially if you achieved that with just a few lines of code!

joyeecheung commented 7 months ago

I think "making Node 2x" faster isn't a very accurate description of the overall situation...or that Node.js being 3x slower than Bun already is a limited description of the situation - the 3x slowness is conditioned to:

This is hashing a lot (100K) of short strings (in a tight loop where GC performance matters, but this isn't as common in the real world)
The input is not streamed.

Once the input is not a string but a buffer, or the input is streamed, or there aren't as many as 100K inputs being hashed in the tight loop, or the input is longer, the order of slowness decreases. For example if the workload is just hashing 1000 buffers with 100 10000 chunks, then Node.js and Bun can already finish the workload in about the same time - with or without my changes. If they are 1000 strings in 100 10000 chunks, then Node.js is about 1.8x slower - with or without my changes.

And then, after getting rid of what can be done on the Node.js side, the primary bottleneck we found are GC (when hashing a lot of small input in a tight loop) and string -> buffer conversion (when hashing strings streamed in many chunks). These are mostly V8 v.s. JSC differences, not Node.js v.s Bun per-se.

lemire commented 7 months ago

@joyeecheung Point taken. My comment was in the scope of the microbenchmark that motivates the issue. I did not mean to imply that you made Node 2x faster in general.

The string-to-buffer conversion issue is annoying.

joyeecheung commented 7 months ago

I tried out an implementation that computes a digest at one shot which looks something like crypto.digest('sha1', input, 'hex') at https://github.com/joyeecheung/node/tree/oneshot-digest to prove my theory about GC performance being the bottleneck - and I think that's correct. Without the intermediate objects and copies, the one-shot implementation is about 3x faster in the case of "hashing 100K short strings (100 bytes)" than the object-based implementation.

'use strict';
const { digest, createHash } = require('node:crypto');
const { readFileSync } = require('node:fs');

const kHashCount = parseInt(process.env.HASH_COUNT) || 100000;
const kInputLength = parseInt(process.env.INPUT_LENGTH) || 1000;

const array = [];
for ( let i = 0; i < kHashCount; i++ ) {
  array.push(null);
}
const input = readFileSync('./test/fixtures/snapshot/typescript.js').toString().slice(0, kInputLength);

const sha1 = digest ?
    (input) => digest('sha1', input, 'hex') :
    (input) => createHash('sha1').update(input).digest('hex');

const start_time = performance.now();
for ( let i = 0; i < kHashCount; i++ ) {
  array[i] = sha1(input);
}
console.log(performance.now() - start_time);
console.log(`${kHashCount} hashes, input length ${input.length}, ${array.length} results`);

$ HASH_COUNT=1000000 INPUT_LENGTH=100 ./node_main hash2.js
2307.756804
1000000 hashes, input length 100, 1000000 results
$ HASH_COUNT=1000000 INPUT_LENGTH=100 out/Release/node hash2.js
716.874251
1000000 hashes, input length 100, 1000000 results

And in this profile, string encoding/decoding performance is the primary bottleneck:

  --21.28%--node::Utf8Value::Utf8Value
    --14.45%--v8::String::WriteUtf8
  --15.36%--EVP_Digest
  --10.80%--node::StringBytes::Encode
    --4.93%--v8::internal::Factory::AllocateRaw
      --4.32%--v8::internal::Heap::CollectGarbage

fabiospampinato commented 7 months ago

It looks like somebody had a similar idea at some point.

It'd be cool if kinda common APIs like that could be made 3x faster. Having both a hash and a createHash function seems a little weird though 🤔

joyeecheung commented 7 months ago

Oh yeah I was thinking "there must be some feature request for this..or there must have been an Open PR for this.." though I wasn't able to find https://github.com/nodejs/node/pull/42233 - that PR would be still be quite a bit slower than the branch I have though because it's missing some pieces (e.g. caching EVP_MD*, using explicit fetching, overhead from the HashJob implementation which is still object-based).

lemire commented 7 months ago

It'd be cool if kinda common APIs like that could be made 3x faster. Having both a hash and a createHash function seems a little weird though

I think you could maybe do it with a lazy/on-demand approach. So right now you compute... crypto.createHash("sha1").update(value).digest("hex") by eagerly doing all the work as soon as you can. This means that you need to have a non-trivial Hash object and do JavaScript-to-C++ calls repeatedly.

Instead, you could have a lightweight JavaScript object that simply collects the information without doing anything. Then when the has value is required, it computes it. In fact, if @joyeecheung implemented createHash, then an on-demand approach like what I describe could be just a few lines of code away, at least in prototypical form.

joyeecheung commented 7 months ago

Instead, you could have a lightweight JavaScript object that simply collects the information without doing anything.

I assume this is talking about buffering in JS land (possibly with a JS array) - the overhead of iterating over that array might already be another bottleneck until we get the new Array iteration API in https://github.com/nodejs/node/pull/50115. It would also increase the memory overhead when there are more updates/larger inputs that won't be otherwise kept alive. For a non-streaming use case, a one-shot digest implementation would still fare better, or users can do the buffering themselves and pass the concatenated buffers into a one-shot digest implementation if for their use case the memory overhead of buffering the original data matter less than hashing performance. (Although as can be seen in https://github.com/nodejs/performance/issues/136#issuecomment-1837588252, when there are more updates in a streaming use case, there performance of Node.js and Bun are already closer to each other).

joyeecheung commented 7 months ago

Having both a hash and a createHash function seems a little weird though 🤔

The crypto interface is largely just mapping onto OpenSSL APIs, so I think a one-shot digest helper + a streaming API would still make sense as there are also similar interface parities in OpenSSL (EVP_Digest v.s. EVP_DigestInit/EVP_DigestUpdate/EVP_DigestFinal_*). They serve different purposes - a streaming API would fair better in an asynchronous application and has lower memory overhead, whereas the one-shot digest is faster if the use case is just computing a digest of a piece of immediately available data synchronously.

lemire commented 7 months ago

I assume this is talking about buffering in JS land (possibly with a JS array)

What I am saying is that crypto.createHash("sha1") only needs an object that contains "sha1". It does not need to do anything, at least nothing in C++ land right away. I am just alluding to the design space you have.

I do agree with you, btw, that it makes a lot of sense to do a one-shot hashing... But what I am also saying is that crypto.createHash("sha1").update(buffer).digest(...) should not have wildly different performance characteristics.

joyeecheung commented 7 months ago

Yeah, there could be some micro-optimization that only starts a object-based implementation upon second update / when the first update contains a big enough input (say >1MB). Otherwise fallback to a one-shot digest implementation. Though the prerequisite would still be having a one-shot digest implementation...

lemire commented 7 months ago

Yeah, there could be some micro-optimization that only starts a object-based implementation upon second update / when the first update contains a big enough input (say >1MB). Otherwise fallback to a one-shot digest implementation. Though the prerequisite would still be having a one-shot digest implementation..

Precisely. :heart: :heart: The idea is that if you propose a new function, not everyone will rush to adopt it. But it might be possible to provide similar benefits to people who use the conventional API. :dash:

merceyz commented 6 months ago

@joyeecheung If you want a real world example where createHash is called a lot then Yarn might provide that.

Setup:

git clone https://github.com/yarnpkg/berry
cd berry
git checkout db6210f48355d2986e965f90009b22f18d3b6342
yarn build:cli --no-minify

Running the following command will call createHash 75068 times:

rm -f .yarn/install-state.gz
YARN_IGNORE_PATH=1 CI=true node ./packages/yarnpkg-cli/bundles/yarn.js

and the following will, at the time of writing, call it 90340 times:

rm -f yarn.lock .yarn/install-state.gz
YARN_IGNORE_PATH=1 node ./packages/yarnpkg-cli/bundles/yarn.js

about 1853 of those should be hashing streamed files and the rest strings.

joyeecheung commented 4 months ago

I looked into the "buffering input" idea for crypto.createHash and it turns out to be more complicated than I thought - most importantly, errors would be delayed until the second update() or digest() is called (e.g. if you pass an algorithm that the OpenSSL version loaded in to the process doesn't support, you can't get an error immediately, because we don't eagerly cache all the supported algorithms to avoid overhead for the first access, so you'd have to wait until the actual hashing call happens and OpenSSL tells us that the algorithm is not supported). Added to it the fact that Hash objects are also streams, it could be difficult to get the error handling correct, So I'll just go with https://github.com/nodejs/node/pull/51044 for now.

H4ad commented 4 months ago

Since https://github.com/nodejs/node/pull/51044 was merged, I think we can close this issue.

Any other discussion about OpenSSL and its perf, I think we should keep at https://github.com/nodejs/performance/issues/72

nodejs / performance