paulmillr / noble-hashes

Audited & minimal JS implementation of hash functions, MACs and KDFs.
https://paulmillr.com/noble
MIT License
545 stars 46 forks source link

Add support for big endian platforms #81

Closed jonathan-albrecht-ibm closed 7 months ago

jonathan-albrecht-ibm commented 7 months ago

This PR adds support for big endian platforms. It adds byte swapping where needed so that all hash functions can run correctly on both big and little endian platforms. It tries to avoid any significant performance degradation on little endian platforms.

The hash families that are affected by this PR are:

All of the other hash families already worked correctly on big endian platforms.

I have included all of the changes in this PR to hopefully make it easy to give feedback on the approach. I'm happy to split it into smaller PRs if preferred.

Most of the byte swapping is done in-place on Uint32Arrays which gives the best performance of the things I tried.

In the blake base class (_blake.ts) update() function, there is one spot where the byte swapping is done on the data32 Uint32Array which is backed by the input message. The byte swapping is reversed before the update() function returns but while the update() function is running the users input message will have been mutated if they passed it in as a Uint32Array. I'm not sure if its ok for the input to be temporarily mutated but, if not, I have a different fix that avoids mutation but has a bit worse performance.

Except for that spot, all other byte swapping should be done only on internal or output buffers.

Thanks in advance for looking at this. I'm happy to make any changes necessary.

paulmillr commented 7 months ago

Thanks for this. Could you show how the perf is degraded after this change?

jonathan-albrecht-ibm commented 7 months ago

Yes I'll try. I've been using the benchmarks to watch performance. I've also been using the nodejs profiler to check if any new functions have started popping up in the cpu profile. On little-endian, my changes should not run any extra byte swapping or loops over the data since that's guarded with isLE checks.

Here is the output of npm run bench on an x86_64 linux vm with nodejs v20.11.0 for the main branch (before) and the big-endian-port branch (after). I think there's no real difference at least within the noisiness of my vm:

main
-------
Benchmarking
SHA256 32B x 214,684 ops/sec @ 4μs/op ± 2.31% (min: 2μs, max: 19ms)
SHA384 32B x 120,525 ops/sec @ 8μs/op
SHA512 32B x 120,279 ops/sec @ 8μs/op ± 1.03% (min: 4μs, max: 5ms)
SHA3-256, keccak256, shake256 32B x 42,286 ops/sec @ 23μs/op
Kangaroo12 32B x 59,708 ops/sec @ 16μs/op
Marsupilami14 32B x 53,047 ops/sec @ 18μs/op
BLAKE2b 32B x 95,084 ops/sec @ 10μs/op
BLAKE2s 32B x 126,182 ops/sec @ 7μs/op ± 1.23% (min: 4μs, max: 5ms)
BLAKE3 32B x 120,860 ops/sec @ 8μs/op ± 1.34% (min: 3μs, max: 15ms)
RIPEMD160 32B x 176,211 ops/sec @ 5μs/op ± 2.54% (min: 3μs, max: 23ms)
HMAC-SHA256 32B x 60,734 ops/sec @ 16μs/op
RAM: rss=148.3mb heap=88.5mb used=62.4mb
-------
Benchmarking
HKDF-SHA256 32 x 24,953 ops/sec @ 40μs/op
HKDF-SHA256 64 x 22,651 ops/sec @ 44μs/op
HKDF-SHA256 256 x 13,537 ops/sec @ 73μs/op ± 1.44% (min: 44μs, max: 9ms)
PBKDF2-HMAC-SHA256 16384 x 13 ops/sec @ 73ms/op ± 7.83% (min: 65ms, max: 89ms)
PBKDF2-HMAC-SHA256 65536 x 3 ops/sec @ 293ms/op ± 4.09% (min: 269ms, max: 314ms)
PBKDF2-HMAC-SHA256 262144 x 0 ops/sec @ 1163ms/op ± 6.25% (min: 1113ms, max: 1285ms)
PBKDF2-HMAC-SHA512 16384 x 5 ops/sec @ 178ms/op ± 4.70% (min: 161ms, max: 201ms)
PBKDF2-HMAC-SHA512 65536 x 1 ops/sec @ 685ms/op ± 3.40% (min: 669ms, max: 721ms)
PBKDF2-HMAC-SHA512 262144 x 0 ops/sec @ 2571ms/op ± 3.21% (min: 2291ms, max: 2659ms)
Scrypt r: 8, p: 1, n: 16384 x 7 ops/sec @ 141ms/op ± 11.25% (min: 115ms, max: 205ms)
Scrypt r: 8, p: 1, n: 65536 x 1 ops/sec @ 509ms/op ± 1.50% (min: 490ms, max: 520ms)
Scrypt r: 8, p: 1, n: 262144 x 0 ops/sec @ 2356ms/op ± 15.57% (min: 2057ms, max: 2802ms)
Scrypt Async r: 8, p: 1, n: 16384 x 6 ops/sec @ 152ms/op ± 9.14% (min: 138ms, max: 210ms)
Scrypt Async r: 8, p: 1, n: 65536 x 1 ops/sec @ 668ms/op ± 1.58% (min: 646ms, max: 681ms)
Scrypt Async r: 8, p: 1, n: 262144 x 0 ops/sec @ 2583ms/op ± 5.06% (min: 2434ms, max: 2754ms)
RAM: rss=357.7mb heap=11.2mb used=6.8mb arr=268.5mb
big-endian-port
-------
Benchmarking
SHA256 32B x 232,234 ops/sec @ 4μs/op ± 2.34% (min: 2μs, max: 18ms)
SHA384 32B x 128,766 ops/sec @ 7μs/op
SHA512 32B x 130,701 ops/sec @ 7μs/op ± 1.01% (min: 4μs, max: 4ms)
SHA3-256, keccak256, shake256 32B x 42,758 ops/sec @ 23μs/op
Kangaroo12 32B x 58,719 ops/sec @ 17μs/op
Marsupilami14 32B x 57,710 ops/sec @ 17μs/op
BLAKE2b 32B x 99,265 ops/sec @ 10μs/op ± 1.59% (min: 6μs, max: 28ms)
BLAKE2s 32B x 115,326 ops/sec @ 8μs/op ± 1.54% (min: 4μs, max: 24ms)
BLAKE3 32B x 116,049 ops/sec @ 8μs/op ± 1.02% (min: 3μs, max: 7ms)
RIPEMD160 32B x 193,948 ops/sec @ 5μs/op ± 2.20% (min: 3μs, max: 19ms)
HMAC-SHA256 32B x 64,304 ops/sec @ 15μs/op
RAM: rss=149.7mb heap=89.3mb used=66.3mb
-------
Benchmarking
HKDF-SHA256 32 x 29,782 ops/sec @ 33μs/op
HKDF-SHA256 64 x 25,902 ops/sec @ 38μs/op
HKDF-SHA256 256 x 14,370 ops/sec @ 69μs/op
PBKDF2-HMAC-SHA256 16384 x 14 ops/sec @ 68ms/op ± 9.10% (min: 61ms, max: 87ms)
PBKDF2-HMAC-SHA256 65536 x 3 ops/sec @ 261ms/op ± 2.66% (min: 249ms, max: 271ms)
PBKDF2-HMAC-SHA256 262144 x 0 ops/sec @ 1060ms/op ± 2.12% (min: 1022ms, max: 1096ms)
PBKDF2-HMAC-SHA512 16384 x 6 ops/sec @ 144ms/op ± 7.41% (min: 130ms, max: 178ms)
PBKDF2-HMAC-SHA512 65536 x 1 ops/sec @ 555ms/op ± 3.22% (min: 530ms, max: 574ms)
PBKDF2-HMAC-SHA512 262144 x 0 ops/sec @ 2239ms/op ± 3.03% (min: 2158ms, max: 2313ms)
Scrypt r: 8, p: 1, n: 16384 x 8 ops/sec @ 121ms/op ± 10.17% (min: 103ms, max: 160ms)
Scrypt r: 8, p: 1, n: 65536 x 2 ops/sec @ 492ms/op ± 3.15% (min: 465ms, max: 518ms)
Scrypt r: 8, p: 1, n: 262144 x 0 ops/sec @ 1932ms/op ± 3.19% (min: 1861ms, max: 2007ms)
Scrypt Async r: 8, p: 1, n: 16384 x 7 ops/sec @ 141ms/op ± 5.43% (min: 120ms, max: 165ms)
Scrypt Async r: 8, p: 1, n: 65536 x 1 ops/sec @ 652ms/op ± 12.74% (min: 556ms, max: 767ms)
Scrypt Async r: 8, p: 1, n: 262144 x 0 ops/sec @ 2290ms/op ± 3.89% (min: 2191ms, max: 2428ms)
RAM: rss=343.4mb heap=11.5mb used=7.5mb arr=268.5mb
paulmillr commented 7 months ago

Good job Jonathan. Proper pull request!

jonathan-albrecht-ibm commented 7 months ago

Thanks for reviewing and merging @paulmillr!