Closed larryhastings closed 2 years ago
With "lean" I meant: doesn't use much code and is easy to compile and install.
I built a wheel from Jack's experimental package and it comes out to just under 100kB on Linux x64, compared to around the 1.1MB the Rust wheel needs:
Archive: blake3_experimental_c-0.0.1-cp310-cp310-linux_x86_64.whl Length Date Time Name --------- ---------- ----- ---- 348528 2022-03-23 18:38 blake3.cpython-310-x86_64-linux-gnu.so 3183 2022-03-23 18:38 blake3_experimental_c-0.0.1.dist-info/METADATA 105 2022-03-23 18:38 blake3_experimental_c-0.0.1.dist-info/WHEEL 7 2022-03-23 18:38 blake3_experimental_c-0.0.1.dist-info/top_level.txt 451 2022-03-23 18:38 blake3_experimental_c-0.0.1.dist-info/RECORD --------- ------- 352274 5 files
Archive: blake3-0.3.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl Length Date Time Name --------- ---------- ----- ---- 3800 2022-01-13 01:26 blake3-0.3.1.dist-info/METADATA 133 2022-01-13 01:26 blake3-0.3.1.dist-info/WHEEL 48 2022-01-13 01:26 blake3/init.py 4195392 2022-01-13 01:26 blake3/blake3.cpython-310-x86_64-linux-gnu.so 382 2022-01-13 01:26 blake3-0.3.1.dist-info/RECORD --------- ------- 4199755 5 files
I don't know why there is such a significant difference in size. Perhaps the Rust version includes multiple variants for different CPU optimizations ?!
I can't answer why the Rust one is so much larger--that's a question for Jack. But the blake3-py you built might (should?) have support for SIMD extensions. See the setup.py for how that works; it appears to at least try to use the SIMD extensions on x86 POSIX (32- and 64-bit), x86_64 Windows, and 64-bit ARM POSIX.
If you were really curious, you could run some quick benchmarks, then hack your local setup.py to not attempt adding support for those (see "portable code only" in setup.py) and do a build, and run your benchmarks again. If BLAKE3 got a lot slower, yup, you (initially) built it with SIMD extension support.
To anyone else who comes along with motivation:
I'm fine with blake3 being in hashlib, but I don't want us to guarantee it by carrying the implementation of the algorithm in the CPython codebase itself unless it gains wide industry standard-like adoption status.
We should feel free to link to both the Rust blake3 and C blake3-py packages from the hashlib docs regardless.
Performance wise... The SHA series have hardware acceleration on modern CPUs and SoCs. External libraries such as OpenSSL are in a position to provide implementations that make use of that. Same with the Linux Kernel CryptoAPI (https://bugs.python.org/issue47102).
Hardware accelerated SHAs are likely faster than blake3 single core. And certainly more efficient in terms of watt-secs/byte.
Here's a wheel which only includes the portable code (I disabled all the special cases as you suggested).
Archive: dist/blake3_experimental_c-0.0.1-cp310-cp310-linux_x86_64.whl Length Date Time Name --------- ---------- ----- ---- 297680 2022-03-23 19:26 blake3.cpython-310-x86_64-linux-gnu.so 3183 2022-03-23 19:26 blake3_experimental_c-0.0.1.dist-info/METADATA 105 2022-03-23 19:26 blake3_experimental_c-0.0.1.dist-info/WHEEL 7 2022-03-23 19:26 blake3_experimental_c-0.0.1.dist-info/top_level.txt 451 2022-03-23 19:26 blake3_experimental_c-0.0.1.dist-info/RECORD --------- ------- 301426 5 files
I didn't run any benchmarks, but it's clear that the SIMD code was used in my initial build and this adds some 50kB to the .so file. This is on a older Linux x64 box with Intel i7-4770k CPU.
Could be that the Rust version adds several such SIMD variants and then branches based on the platform running the code.
In any case, the C extension is indeed very easy to build and install with a standard compiler setup.
Rust based anything comes with a baseline level of Rust code overhead. https://stackoverflow.com/questions/29008127/why-are-rust-executables-so-huge
That seems expected.
Performance wise... The SHA series have hardware acceleration on modern CPUs and SoCs. External libraries such as OpenSSL are in a position to provide implementations that make use of that. Same with the Linux Kernel CryptoAPI (https://bugs.python.org/issue47102).
Hardware accelerated SHAs are likely faster than blake3 single core. And certainly more efficient in terms of watt-secs/byte.
I don't know if OpenSSL currently uses the Intel SHA1 extensions. A quick google suggests they added support in 2017. And:
Hardware accelerated SHAs are likely faster than blake3 single core.
Surprisingly, they're not. Here's a quick measurement on my recent ThinkPad laptop (64 KiB of input, single-threaded, TurboBoost left on), which supports both AVX-512 and the SHA extensions:
OpenSSL SHA-256: 1816 MB/s OpenSSL SHA-1: 2103 MB/s BLAKE3 SSE2: 2109 MB/s BLAKE3 SSE4.1: 2474 MB/s BLAKE3 AVX2: 4898 MB/s BLAKE3 AVX-512: 8754 MB/s
The main reason SHA-1 and SHA-256 don't do better is that they're fundamentally serial algorithms. Hardware acceleration can speed up a single instance of their compression functions, but there's just no way for it to run more than one instance per message at a time. In contrast, AES-CTR can easily parallelize its blocks, and hardware accelerated AES does beat BLAKE3.
And certainly more efficient in terms of watt-secs/byte.
I don't have any experience measuring power myself, so take this with a grain of salt: I think the difference in throughput shown above is large enough that, even accounting for the famously high power draw of AVX-512, BLAKE3 comes out ahead in terms of energy/byte. Probably not on ARM though.
sha1 should be considered broken anyway and sha256 does not perform well on 64bit systems. Truncated sha512 (sha512-256) typically performs 40% faster than sha256 on X86_64. It should get you close to the performance of BLAKE3 SSE4.1 on your system.
Truncated sha512 (sha512-256) typically performs 40% faster than sha256 on X86_64.
Without hardware acceleration, yes. But because SHA-NI includes only SHA-1 and SHA-256, and not SHA-512, it's no longer a level playing field. OpenSSL's SHA-512 and SHA-512/256 both get about 797 MB/s on my machine.
You missed the key "And certainly more efficient in terms of watt-secs/byte" part.
I did reply to that point above with some baseless speculation, but now I can back up my baseless speculation with unscientific data :)
https://gist.github.com/oconnor663/aed7016c9dbe5507510fc50faceaaa07
According to whatever powerstat -R
measures on my laptop, running hardware-accelerated SHA-256 in a loop for a minute or so takes 26.86 Watts on average. Doing the same with AVX-512 BLAKE3 takes 29.53 Watts, 10% more. Factoring in the 4.69x difference in throughput reported by those loops, the overall energy/byte for BLAKE3 is 4.27x lower than SHA-256. This is my first time running a power benchmark, so if this sounds implausible hopefully someone can catch my mistakes.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['type-feature', 'library', '3.11']
title = 'add BLAKE3 to hashlib'
updated_at =
user = 'https://github.com/larryhastings'
```
bugs.python.org fields:
```python
activity =
actor = 'oconnor663'
assignee = 'none'
closed = True
closed_date =
closer = 'larry'
components = ['Library (Lib)']
creation =
creator = 'larry'
dependencies = []
files = []
hgrepos = []
issue_num = 39298
keywords = ['patch']
message_count = 62.0
messages = ['359777', '359794', '359796', '359936', '359941', '360152', '360215', '360535', '360838', '360840', '361918', '361925', '363397', '391355', '391356', '391360', '391418', '401070', '401093', '410277', '410293', '410363', '410364', '410386', '410451', '410452', '410453', '410454', '410455', '413413', '413414', '413490', '413563', '414532', '414533', '414543', '414544', '414565', '414569', '414571', '415761', '415806', '415823', '415842', '415846', '415847', '415863', '415885', '415888', '415889', '415894', '415895', '415897', '415898', '415900', '415902', '415920', '415948', '415951', '415952', '415960', '415973']
nosy_count = 11.0
nosy_names = ['lemburg', 'gregory.p.smith', 'larry', 'christian.heimes', 'mgorny', "Zooko.Wilcox-O'Hearn", 'jstasiak', 'oconnor663', 'corona10', 'xtreak', 'kmaork']
pr_nums = ['31686']
priority = 'normal'
resolution = 'rejected'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue39298'
versions = ['Python 3.11']
```