msprotz commented 1 year ago

Feature or enhancement

We propose to replace the non-OpenSSL cryptographic primitives in hashlib with high-assurance, verified versions from the HACL* project.

Pitch

As evidenced by the recent SHA3 buffer overflow, cryptographic primitives are tricky to implement correctly. There might be issues with memory management, exceeding lengths, incorrect buffer management, or worse, incorrect implementations in corner cases.

The HACL* project https://github.com/hacl-star/hacl-star provides verified implementations of cryptographic primitives. These implementations are mathematically shown to be:

memory safe (no buffer overflows, no use-after-free)
functionally correct (they always compute the right result)
side-channel resistant (the most egregious variants of side-channels, such as memory and timing leaks, are ruled out by construction).

See https://hacl-star.github.io/Overview.html#what-is-verified-software for a longer description of how formal methods can help write high-assurance software and rule out entire classes of bugs.

The performance of HACL is competitive with, and sometimes exceeds, that of OpenSSL. HACL is distributed as pure C, and therefore is portable. Parts of HACL* have been adopted in Mozilla, Linux, the Tezos blockchain, and many more, thereby demonstrating that formally verified code is ready for production-time.

Previous discussion

Tagging @alex with whom I've informally discussed this.

PR: gh-99109

PR: gh-101707

PR: gh-101917

PR: gh-101924

PR: gh-102089

PR: gh-102183

PR: gh-102308

PR: gh-103597

PR: gh-104293

PR: gh-104401

PR: gh-104675

PR: gh-104776

PR: gh-104808

PR: gh-104893

PR: gh-105634

PR: gh-105635

PR: gh-117237

PR: gh-117243

PR: gh-117295

PR: gh-117302

PR: gh-119316

PR: gh-123027

PR: gh-123146

gpshead commented 1 year ago

(missing context: there was a private email thread cc'ing ~7 of us including myself, @tiran, @alex, @msprotz & others that led to this issue and PR) in response to https://github.com/python/cpython/issues/98517 CVE happening with our old XKCP sha3 vendored code.

One reason I might support this work across all of our required always supported hashes is that if these perform well, I'd be happy to see the _hashlib module wrapping OpenSSL deprecated entirely. (Others probably disagree with that, so we may never get there). One less thing depending on OpenSSL sounds nice - but the OpenSSL EVP interface and these algorithms are far less likely to be where OpenSSL CVE bugs lurk (famous last words). BUT... if the vendored third party code we use as a fallback for hash algorithms can be replaced with something nearly guaranteed not to be the next CVE as the hack-star project appears to aim to be (I cannot actually be the judge of that), and be high performance... Why should we bother involving OpenSSL for these at all?

If this sounds like the opposite of my statements over buried in the blake3 rejection thread - it's because I did not seek an alternative to the variety of vendored fallback hash algorithm implementations we carry there. So the goal in those statements was to lean towards simplicity instead of complexity. (which has already paid off via our XKCP removal in favor of tiny_sha3 by not having Python 3.11+ suffer from the sha3 XKCP CVE that the larger complex third party code in 3.6-3.10 did)

We provide and ship native implementations for the hash algorithms anyways as we support builds without OpenSSL present, and support building on systems with libssl versions or variants that do not provide everything.

Q: Would adopting this code increase our maintenance burden?

A: Probably, yes. We'd presumably want to update it with more recent versions from hacl-star from time to time. Though the theory is that there should never be a security need to do so? In the past we've been asked to update vendored hash algorithm code from time to time but have hesitated due to the complexity and perceived risk, and assumed lack of reward given people who care about performance would presumably have a modern OpenSSL to get that from. Though if we did get rid of the OpenSSL wrapper code that'd also be a reduced burden.

Q: What about performance?

A: Good question. That needs continual measuring on our Tiered supported platforms. Don't focus solely on thruput. OpenSSL's EVP APIs are known to have silly-slow setup time. So something hashing a pile of 234 byte objects may be slow no matter how good OpenSSL's bulk data thruput is. Benchmark small message thruput as well as bulk data thruput.

Parting thought: There is a broader desire to wean us off of our OpenSSL dependency. I realize that's a hard thing so long as the _ssl module exists, but it is a long term desire of many. This kind of change would be one necessary step, regardless of when it is taken.

msprotz commented 1 year ago

Thanks for your comment, Greg. Some related thoughts.

We have support for all of the algorithms in hashlib (md5, sha1, sha2, sha3, blake2). So there is a clear pathway to consolidating those implementations together into a single vendored library that is side-channel resistant, and that has proofs of correctness.
Getting rid of OpenSSL is feasible, but not today (as you accurately pointed out on #99109). For now, we are proposing pure C that is easy to compile, portable, fast, and secure. We have a technology for mixing ASM and C, or even using C compiler intrinsics -- that's what we used to beat OpenSSL in a 2020 paper, on one specific algorithm. But then there are numerous drawbacks:
- your build becomes more complicated: you need to probe the current toolchain and make sure it even knows about e.g. that fancy AVX512 intrinsic you're about to use; or that it can use inline ASM; etc.
- your runtime becomes more complicated: once the compiled binary is copied to another computer, you need to perform run-time CPU detection, so that in the event that you did manage to compile that AVX512 version, then you do use it if the processor you're running on happens to have those fancy instructions
- your maintenance burden becomes greater, and you need to decide how many ASM versions you want to support, for which architecture families, etc.

None of this is infeasible -- it's just a matter of deciding what is good for a Python-in-the-future where hashlib no longer depends on OpenSSL. For instance, you might say "let's maintain a portable C version, and one single ASM version that uses SHAEXT when available (fastest)", and this is something we could make happen within HACL*.

Just to give you a sense of what we have right now, allow me to go into greater detail:

MD5, SHA1: legacy broken algorithms, no particular performance optimizations have been performed
SHA2: on-par with OpenSSL noasm+nohw, option to use SHAEXT when available but currently not part of #99109 -- happy to do that in a followup PR
SHA3: a few low-hanging fruits on our side to make performance better, happy to work on that if this is what it takes for Python to integrate it
Blake2: regular portable C, AVX and AVX2 versions, just like the reference implementation; same as SHA3, probably needs a few performance tweaks before integrating.

Hope this helps. As you said, this kind of change is a first step in the right direction.

gpshead commented 1 year ago

Alright, the first PR is in. Checkbox time:

[x] 2a) The initial HACL* SHA2 sha256 and sha224 implementation PR has been merged.
[x] 2b) SHA384 & SHA512 support
[x] 3) SHA3 support
[x] SHA1 and MD5
[ ] HMAC for the above
[x] is Blake2b competitive? It'll need a performance comparison as we currently ship blake2 code that from upstream blake2 that includes at least amd64 sse4.1 asm optimizations.

msprotz commented 1 year ago

Great! sha384/sha512 is a no-brainer and I'll send a followup PR very soon. We have SHA3 as well, so I can send that in, too, with the caveat that we have a few performance optimizations for sha3 in the works, so we can either wait for those to land in HACL then submit the Python PR, or first land SHA3 in Python, then send a followup PR to refresh the HACL code once the performance tweaks have landed. Happy to hear your thoughts on this.

gpshead commented 1 year ago

Given that the SHA3 implementation we currently ship is a tiny poor performing one, no need to wait. I expect we'll pull in updates to the HACL* implementations as needed in the future when there is a motivating reason.

msprotz commented 1 year ago

@gpshead I'll let you check MD5/SHA1 in the checklist above... quick question: what is the timeline for landing sha3? it requires a little bit of work on my side and I'll be traveling all March, so ideally I would submit the SHA3 PR in April. But if it's a dealbreaker because of e.g. the Python release schedule, I can scramble to try to submit it before then. Let me know.

gpshead commented 1 year ago

So long as we can get it committed before 3.12beta1 at the beginning of May (see https://peps.python.org/pep-0693/) we're good. So April for that is fine by me. No need to scramble.

msprotz commented 1 year ago

With #103597, there is only one algorithm left in hashlib that is not verified: Blake2.

I took a quick glance at the code, and I have a few high-level points to discuss. Would love to have feedback on these, below. Thanks!

Performance

HACL_blake2s_32_streaming                  3758 ns         3755 ns       186711
HACL_blake2s_vec128_streaming              3265 ns         3251 ns       205682
OpenSSL_hash_streaming/blake2s256          4558 ns         4550 ns       136367

This is a benchmark of HACL's "streaming" API (the one that we use here, with internal buffering) vs. the same flavor of API from OpenSSL. Lower is better. I don't have yet a comparison against official libb2 / reference implementation but will update this once I have it. In any case, it looks like the performance is good enough to alleviate any concerns. Let me know if this isn't the case.

system libb2

It looks like there are three possible implementations for blake2. Vendored reference implementation, libb2 implementation if found on the system, and openssl if hashlib is configured to use openssl. Is that correct?

Why bother with a system libb2? Is there a strong reason? If the HACL implementation is competitive with OpenSSL (per benchmark above), then is it ok to remove support for libb2? (Looking to simplify things here!)

CPU feature detection

Blake2 has regular (portable) versions, and optimized versions, which require support for a vectorized instruction set (SSSE3, SSE4.1, AVX, or XOP). The test that determines whether to use these versions is:

#if defined(__SSSE3__) || defined(__SSE4_1__) || defined(__AVX__) || defined(__XOP__)

I can see a few potential issues, and I would like to know how much the Python team cares about those.

First, picking an optimized version is entirely dependent on what the compiler thinks is a baseline microarchitecture for the target (i.e. the default value of -march, determined at compiler configure-time). On my OSX machine, the test above is true with Apple clang, but false with gcc-12. (GCC was configured to assume x86_64 as its baseline target, which has no support for vector instructions. Clang probably was configured by Apple to assume the first two flags, since I presume no Apple device exists without SSSE3 nor SSE4.1.) Since (as far as I can tell) the Python build system doesn't use any particular -march flag, the choice of optimization level for Blake2 depends on the choice of compiler. (It shouldn't.)
Second, the feature test is done at compile-time instead of run-time, which poses portability issues. For instance, on my Linux box, both clang and gcc assume no particular support for any of these instruction sets at compile-time, meaning the optimized implementation is never selected. This is a correct, but suboptimal choice. (Note that using libb2 doesn't help either, since Debian assumes i686 but not above, in order to generate portable packages.) Conversely, there might be instances in which the user compiles with -march=native and enables the optimized version, but produces a binary that might not execute successfully on all systems.

Since this doesn't appear to have been a problem so far, my hunch is that the optimized version is almost never selected, because most compilers are conservative in their choice of minimum target architecture (e.g. i686 and x86_64 are the respective baselines, but nothing more than that). This means that unless you are compiling with Apple clang, or use custom CFLAGS, the version you're picking up is almost certainly the baseline version with no vectorized optimizations.

HACL has baseline and* vectorized implementations. But if we wanted to keep it simple, I think it would be ok to ship just the regular version as part of hashlib. Let me know what you think.

gpshead commented 1 year ago

blake2:

3.11 shipped with the ability to link against libb2 via https://github.com/python/cpython/issues/91251, some OS distro packagers presumably like that and libb2 should be considered better than our old vendored copy of that. I believe it might use proper runtime feature detection?

OpenSSL doesn't support enough blake2 to provide all of the APIs hashlib supports for blake2, so libb2 or our older vendored copy of that is always used for hashlib.blake2b & blake2s functions regardless. So OpenSSL isn't really in the picture here. So we're down to two implementations, not three.

I agree that the optimized version selection logic at compile time is the wrong approach. Useful on distros like Gentoo (which I believe is where the PR author was coming from) as people tend to build the entire system with recent-arch compiler flags there. But you are right that it probably doesn't do what was intended for a lot of users and just uses the C reference version. That was https://github.com/python/cpython/pull/4066. It'll be good to see that go away. :)

If we do a HACL implementation of blake2, it should replace our vendored blake2 (C and optimized version) code.

Performance comparison numbers of that vs linking against libb2 as shipped on a common debian or fedora-like distro can show if there is sufficient reason to keep the libb2 option at all. 100% agreed that getting rid of libb2 support in the presence of a HACL vendored version may be simpler code to maintain (desirable). We'd be down to a single version at that point!

msprotz commented 1 year ago

Thanks, that's useful context!

Good to know that openssl is not used for blake2. In that case, these are not used? https://github.com/python/cpython/blob/main/Modules/_hashopenssl.c#L133
You are correct that libb2 performs proper CPU detection using inline asm to run cpuid. (I just checked.) This is the right way to do it, and also how we do it in HACL for our APIs that perform automatic CPU detection (not part of the current batch of PRs).

My suggestion for Blake2 is to do it in two phases.

Phase 1: replaced vendored reference implementation (both regular C and AVX intrinsics versions) with plain, non-AVX HACL code. Rationale: almost no one was getting the AVX version anyhow (per discussion above), except perhaps Gentoo users who also happened to not install libb2. So, a good opportunity for simplification.
Phase 2: do a general overhaul of hashlib to leverage optimized code from HACL*. Right now, we have AVX (and NEON) blake2s, AVX2 blake2b, and SHA-EXT sha256/sha224. This can be done in a later phase, as it involves bringing in CPU autodetection facilities (to do proper CPUID like libb2 does), and requires a little bit of build work.

I'll come back here shortly with some benchmarks.

gpshead commented 1 year ago

Regarding https://github.com/python/cpython/blob/main/Modules/_hashopenssl.c#L133 the _hashlib blake2b512 and blakes2s256 OpenSSL functions exist and can be called via the hashlib.new() public API using those as string algorithm names, but they aren't documented and should be considered versions not guaranteed to be present in a build. I'd be surprised if anyone actually ever used them - but we wouldn't bother to remove them unless we were at the point of getting rid of the OpenSSL backed hashlib entirely.

msprotz commented 1 year ago

The benchmarks for HACL vs Blake2 reference implementation are still on a branch, but I managed to run them. Here are the relevant lines:

BLAKE2_blake2b_ref_streaming               2450 ns         2449 ns       276237
BLAKE2_blake2s_ref_streaming               2990 ns         2989 ns       226532
HACL_blake2b_32_streaming                  2746 ns         2744 ns       249755
HACL_blake2s_32_streaming                  3557 ns         3555 ns       191316

This is assuming that Blake2 reference implementations are not running optimized (which, per our discussion above, is likely the reality for most users).

The short version is that HACL is about 10% slower than the reference implementation. But if most users run with libb2 anyhow, and the goal is to streamline the crypto in hashlib, then I'm happy to submit a subsequent PR for Blake2. Let me know.

(In the meanwhile we'll try to take a look at the performance gap.)

gpshead commented 1 year ago

Sounds good, go ahead and work up a PR for blake2 HACL when you've got a chance.

Don't fret about the 3.12beta1 feature freeze cutoff for blake2 - while our release manager has technically delayed that to two weeks from today, it isn't a big deal if a blake2 built-in update misses 3.12.

msprotz commented 1 year ago

Quick update: I took a look at our blake2 code and it doesn't expose enough of the parameterized API which Python currently offers for tree hashing. It's not a big deal, but means there's enough work on our end that Blake2 won't make it into 3.12 but rather into the next release... I'll post updates here once we make enough progress upstream.

encukou commented 1 month ago

The PR broke several stable x86_64-unknown-linux-gnu RHEL7 buildbots with:

../Modules/_hacl/Lib_Memzero0.c:43:5: error: implicit declaration of function ‘explicit_bzero’ [-Werror=implicit-function-declaration]
     explicit_bzero(dst, len_);
     ^~~~~~~~~~~~~~

By one reading of PEP-11, this needs to be fixed or reverted in 24 hours. Is that possible?

Note that while RHEL7 itself is past its end-of-maintenance phase, and breaking it doesn't really matter, it serves as a “canary” for platforms that don't support the non-standard extension -- Android being most prominent in this case, but also smaller systems without the resources to maintain buildbots. IMO, CPython should stick to C & POSIX standards, with anything extra behind a feature check.

mhsmith commented 1 month ago

This is partly my fault for not setting up the Android buildbot soon enough, so I'll submit a fix today.

This probably won't affect any other platform, since it's already guarded by a __linux__ check, and explicit_bzero is apparently available in current versions of both glibc and musl. But Android has a Linux kernel with a BSD-based libc.

As for RHEL7, it looks like the logical thing to do is to disable its buildbot on the main branch, but keep it running on 3.13 and older.

gpshead commented 1 month ago

Thanks! Also, sorry, I forgot about the RHEL7 canary side of things when merging. (Too many things being chased at once). The fix should just be an autoconf dance.

On Wed, Aug 14, 2024, 5:55 AM Malcolm Smith @.***> wrote:

This is partly my fault for not setting up the Android buildbot soon enough, so I'll submit a fix today.

This probably won't affect any other platform, since it's already guarded by a linux check, and explicit_bzero is apparently available in current versions of both glibc and musl. But Android has a Linux kernel with a BSD-based libc.

As for RHEL7, it looks like the logical thing to do is to disable its buildbot on the main branch, but keep it running on 3.13 and older.

— Reply to this email directly, view it on GitHub https://github.com/python/cpython/issues/99108#issuecomment-2288666002, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQXCZ3BFDIGDHUMVLW5CDZRNHVZAVCNFSM6AAAAAARXSQKPKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBYGY3DMMBQGI . You are receiving this because you were mentioned.Message ID: @.***>

msprotz commented 1 month ago

This is partly my fault for not setting up the Android buildbot soon enough, so I'll submit a fix today.

Do I understand correctly that you'll try to submit an autoconf fix? In that case, I would add an autoconf check here: https://github.com/python/cpython/blob/main/configure.ac#L7777-L7784 to determine whether explicit_bzero is unavailable; if it is unavailable, define LIBHACL_MEMZERO_FLAGS=-DLINUX_NO_EXPLICIT_BZERO, then AC_SUBST the flag, then add the flag here: https://github.com/python/cpython/blob/main/Makefile.pre.in#L1393

Hope this helps!

mhsmith commented 1 month ago

Thanks, yes, that's what I'm planning to do.

python / cpython

Replace built-in hashlib with verified implementations from HACL* #99108

Feature or enhancement

Pitch

Previous discussion

Performance

system libb2

CPU feature detection