jedisct1 / libsodium

A modern, portable, easy to use crypto library.
https://libsodium.org
Other
12.28k stars 1.75k forks source link

Add Intel SHA extension for SHA-256 #668

Closed noloader closed 6 years ago

noloader commented 6 years ago

This patch adds SHA-256 support using SHA extensions. It is a hack because of my lack of knowledge of libsodium. I don't know how to cut-in a new ISA or CPU feature, so I changed the SHA C code to SHA extensions for cut-in and testing. Someone more familiar with libsodium needs to take it further.

Credit should got to Sean Gulley of Intel. He wrote the article New Instructions Supporting the Secure Hash Algorithm on Intel® Architecture Processors. Later, I found his reference implementation at mitls | experimental | hash to fill in the missing pieces from the Intel blog. We deviated slightly by using unaligned loads and stores to avoid SIGBUS on unaligned buffers.

SHA-256 will run at about 3.8 cpb using Intel extensions. SHA extensions are available on Goldmont and Goldmont+. The patch below was tested on a Celeron J3455, which is Goldmont. I purchased it specifically for testing SHA instructions. You can also test on the GCC Compile Farm. GCC67, which is the AMD Ryzen 1700x, has SHA extensions.

libsodium was configured with the following for testing. I did not feel like messing with someone else's configure.ac. Thanks for respecting my CFLAGS and CXXFLAGS. It made it very easy to test this patch. libsodium's 72 self tests pass with the patch in effect.

CFLAGS="-g3 -O1 -msse4.2 -msha" CXXFLAGS="-g3 -O1 -msse4.2 -msha" ./configure

Here is the diff shown below: sha.diff.tar.gz

$ cat sha256.diff
diff --git a/src/libsodium/crypto_hash/sha256/cp/hash_sha256_cp.c b/src/libsodium/crypto_hash/sha256/cp/hash_sha256_cp.c
index 264054f9..62c5b173 100644
--- a/src/libsodium/crypto_hash/sha256/cp/hash_sha256_cp.c
+++ b/src/libsodium/crypto_hash/sha256/cp/hash_sha256_cp.c
@@ -37,6 +37,10 @@
 #include "private/common.h"
 #include "utils.h"

+/* Intel SHA instructions */
+#include <nmmintrin.h>
+#include <immintrin.h>
+
 static void
 be32enc_vect(unsigned char *dst, const uint32_t *src, size_t len)
 {
@@ -94,6 +98,206 @@ static const uint32_t Krnd[64] = {
     W[i + ii + 16] =   \
         s1(W[i + ii + 14]) + W[i + ii + 9] + s0(W[i + ii + 1]) + W[i + ii]

+/* Intel SHA instructions. Requires -msse4.2 -msha */
+#if defined(__SHA__)
+static void
+SHA256_Transform(uint32_t state[8], const uint8_t data[64], uint32_t W[64],
+                 uint32_t S[8])
+{
+    __m128i STATE0, STATE1;
+    __m128i MSG, TMP, MASK;
+    __m128i MSG0, MSG1, MSG2, MSG3;
+    __m128i ABEF_SAVE, CDGH_SAVE;
+
+    /* Hack for single block */
+    unsigned int length = 64;
+
+    /* Load initial values */
+    TMP = _mm_loadu_si128((__m128i*) &state[0]);
+    STATE1 = _mm_loadu_si128((__m128i*) &state[4]);
+    MASK = _mm_set_epi64x(0x0c0d0e0f08090a0bULL, 0x0405060700010203ULL);
+
+    TMP = _mm_shuffle_epi32(TMP, 0xB1);          /* CDAB */
+    STATE1 = _mm_shuffle_epi32(STATE1, 0x1B);    /* EFGH */
+    STATE0 = _mm_alignr_epi8(TMP, STATE1, 8);    /* ABEF */
+    STATE1 = _mm_blend_epi16(STATE1, TMP, 0xF0); /* CDGH */
+
+    while (length >= 64)
+    {
+        /* Save current state */
+        ABEF_SAVE = STATE0;
+        CDGH_SAVE = STATE1;
+
+        /* Rounds 0-3 */
+        MSG = _mm_loadu_si128((const __m128i*) (data+0));
+        MSG0 = _mm_shuffle_epi8(MSG, MASK);
+        MSG = _mm_add_epi32(MSG0, _mm_set_epi64x(0xE9B5DBA5B5C0FBCFULL, 0x71374491428A2F98ULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+
+        /* Rounds 4-7 */
+        MSG1 = _mm_loadu_si128((const __m128i*) (data+16));
+        MSG1 = _mm_shuffle_epi8(MSG1, MASK);
+        MSG = _mm_add_epi32(MSG1, _mm_set_epi64x(0xAB1C5ED5923F82A4ULL, 0x59F111F13956C25BULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+        MSG0 = _mm_sha256msg1_epu32(MSG0, MSG1);
+
+        /* Rounds 8-11 */
+        MSG2 = _mm_loadu_si128((const __m128i*) (data+32));
+        MSG2 = _mm_shuffle_epi8(MSG2, MASK);
+        MSG = _mm_add_epi32(MSG2, _mm_set_epi64x(0x550C7DC3243185BEULL, 0x12835B01D807AA98ULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+        MSG1 = _mm_sha256msg1_epu32(MSG1, MSG2);
+
+        /* Rounds 12-15 */
+        MSG3 = _mm_loadu_si128((const __m128i*) (data+48));
+        MSG3 = _mm_shuffle_epi8(MSG3, MASK);
+        MSG = _mm_add_epi32(MSG3, _mm_set_epi64x(0xC19BF1749BDC06A7ULL, 0x80DEB1FE72BE5D74ULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        TMP = _mm_alignr_epi8(MSG3, MSG2, 4);
+        MSG0 = _mm_add_epi32(MSG0, TMP);
+        MSG0 = _mm_sha256msg2_epu32(MSG0, MSG3);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+        MSG2 = _mm_sha256msg1_epu32(MSG2, MSG3);
+
+        /* Rounds 16-19 */
+        MSG = _mm_add_epi32(MSG0, _mm_set_epi64x(0x240CA1CC0FC19DC6ULL, 0xEFBE4786E49B69C1ULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        TMP = _mm_alignr_epi8(MSG0, MSG3, 4);
+        MSG1 = _mm_add_epi32(MSG1, TMP);
+        MSG1 = _mm_sha256msg2_epu32(MSG1, MSG0);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+        MSG3 = _mm_sha256msg1_epu32(MSG3, MSG0);
+
+        /* Rounds 20-23 */
+        MSG = _mm_add_epi32(MSG1, _mm_set_epi64x(0x76F988DA5CB0A9DCULL, 0x4A7484AA2DE92C6FULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        TMP = _mm_alignr_epi8(MSG1, MSG0, 4);
+        MSG2 = _mm_add_epi32(MSG2, TMP);
+        MSG2 = _mm_sha256msg2_epu32(MSG2, MSG1);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+        MSG0 = _mm_sha256msg1_epu32(MSG0, MSG1);
+
+        /* Rounds 24-27 */
+        MSG = _mm_add_epi32(MSG2, _mm_set_epi64x(0xBF597FC7B00327C8ULL, 0xA831C66D983E5152ULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        TMP = _mm_alignr_epi8(MSG2, MSG1, 4);
+        MSG3 = _mm_add_epi32(MSG3, TMP);
+        MSG3 = _mm_sha256msg2_epu32(MSG3, MSG2);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+        MSG1 = _mm_sha256msg1_epu32(MSG1, MSG2);
+
+        /* Rounds 28-31 */
+        MSG = _mm_add_epi32(MSG3, _mm_set_epi64x(0x1429296706CA6351ULL,  0xD5A79147C6E00BF3ULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        TMP = _mm_alignr_epi8(MSG3, MSG2, 4);
+        MSG0 = _mm_add_epi32(MSG0, TMP);
+        MSG0 = _mm_sha256msg2_epu32(MSG0, MSG3);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+        MSG2 = _mm_sha256msg1_epu32(MSG2, MSG3);
+
+        /* Rounds 32-35 */
+        MSG = _mm_add_epi32(MSG0, _mm_set_epi64x(0x53380D134D2C6DFCULL, 0x2E1B213827B70A85ULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        TMP = _mm_alignr_epi8(MSG0, MSG3, 4);
+        MSG1 = _mm_add_epi32(MSG1, TMP);
+        MSG1 = _mm_sha256msg2_epu32(MSG1, MSG0);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+        MSG3 = _mm_sha256msg1_epu32(MSG3, MSG0);
+
+        /* Rounds 36-39 */
+        MSG = _mm_add_epi32(MSG1, _mm_set_epi64x(0x92722C8581C2C92EULL, 0x766A0ABB650A7354ULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        TMP = _mm_alignr_epi8(MSG1, MSG0, 4);
+        MSG2 = _mm_add_epi32(MSG2, TMP);
+        MSG2 = _mm_sha256msg2_epu32(MSG2, MSG1);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+        MSG0 = _mm_sha256msg1_epu32(MSG0, MSG1);
+
+        /* Rounds 40-43 */
+        MSG = _mm_add_epi32(MSG2, _mm_set_epi64x(0xC76C51A3C24B8B70ULL, 0xA81A664BA2BFE8A1ULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        TMP = _mm_alignr_epi8(MSG2, MSG1, 4);
+        MSG3 = _mm_add_epi32(MSG3, TMP);
+        MSG3 = _mm_sha256msg2_epu32(MSG3, MSG2);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+        MSG1 = _mm_sha256msg1_epu32(MSG1, MSG2);
+
+        /* Rounds 44-47 */
+        MSG = _mm_add_epi32(MSG3, _mm_set_epi64x(0x106AA070F40E3585ULL, 0xD6990624D192E819ULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        TMP = _mm_alignr_epi8(MSG3, MSG2, 4);
+        MSG0 = _mm_add_epi32(MSG0, TMP);
+        MSG0 = _mm_sha256msg2_epu32(MSG0, MSG3);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+        MSG2 = _mm_sha256msg1_epu32(MSG2, MSG3);
+
+        /* Rounds 48-51 */
+        MSG = _mm_add_epi32(MSG0, _mm_set_epi64x(0x34B0BCB52748774CULL, 0x1E376C0819A4C116ULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        TMP = _mm_alignr_epi8(MSG0, MSG3, 4);
+        MSG1 = _mm_add_epi32(MSG1, TMP);
+        MSG1 = _mm_sha256msg2_epu32(MSG1, MSG0);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+        MSG3 = _mm_sha256msg1_epu32(MSG3, MSG0);
+
+        /* Rounds 52-55 */
+        MSG = _mm_add_epi32(MSG1, _mm_set_epi64x(0x682E6FF35B9CCA4FULL, 0x4ED8AA4A391C0CB3ULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        TMP = _mm_alignr_epi8(MSG1, MSG0, 4);
+        MSG2 = _mm_add_epi32(MSG2, TMP);
+        MSG2 = _mm_sha256msg2_epu32(MSG2, MSG1);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+
+        /* Rounds 56-59 */
+        MSG = _mm_add_epi32(MSG2, _mm_set_epi64x(0x8CC7020884C87814ULL, 0x78A5636F748F82EEULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        TMP = _mm_alignr_epi8(MSG2, MSG1, 4);
+        MSG3 = _mm_add_epi32(MSG3, TMP);
+        MSG3 = _mm_sha256msg2_epu32(MSG3, MSG2);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+
+        /* Rounds 60-63 */
+        MSG = _mm_add_epi32(MSG3, _mm_set_epi64x(0xC67178F2BEF9A3F7ULL, 0xA4506CEB90BEFFFAULL));
+        STATE1 = _mm_sha256rnds2_epu32(STATE1, STATE0, MSG);
+        MSG = _mm_shuffle_epi32(MSG, 0x0E);
+        STATE0 = _mm_sha256rnds2_epu32(STATE0, STATE1, MSG);
+
+        /* Combine state  */
+        STATE0 = _mm_add_epi32(STATE0, ABEF_SAVE);
+        STATE1 = _mm_add_epi32(STATE1, CDGH_SAVE);
+
+        data += 64;
+        length -= 64;
+    }
+
+    TMP = _mm_shuffle_epi32(STATE0, 0x1B);       /* FEBA */
+    STATE1 = _mm_shuffle_epi32(STATE1, 0xB1);    /* DCHG */
+    STATE0 = _mm_blend_epi16(TMP, STATE1, 0xF0); /* DCBA */
+    STATE1 = _mm_alignr_epi8(STATE1, TMP, 8);    /* ABEF */
+
+    /* Save state */
+    _mm_storeu_si128((__m128i*) &state[0], STATE0);
+    _mm_storeu_si128((__m128i*) &state[4], STATE1);
+}
+#else
 static void
 SHA256_Transform(uint32_t state[8], const uint8_t block[64], uint32_t W[64],
                  uint32_t S[8])
@@ -143,6 +347,7 @@ SHA256_Transform(uint32_t state[8], const uint8_t block[64], uint32_t W[64],
         state[i] += S[i];
     }
 }
+#endif

 static const uint8_t PAD[64] = { 0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                  0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
jedisct1 commented 6 years ago

Hi Jeff,

SHA256 is not used by any high-level APIs, it's not even compiled in minimal mode, which means that it will eventually be removed. So, I don't think we should add new implementations for it at this point.

SHA512 would be far more interesting. Where there are no dedicated opcodes for it, Intel provides some guidance for AVX implementations that might be worth checking out.

noloader commented 6 years ago

SHA512 would be far more interesting.

Ack. SHA-512 would be Power8 or ARMv8.4.

I have not done a Power8 SHA-512 yet. IBM has some of the worse docs on the planet and it takes me 3 times as long to cut something in. It is on my roadmap.

ARMv8.4 was announced recently; see Introducing 2017’s extensions to the Arm Architecture. There is no hardware in the field I am aware of. ARM FVP emulators provide it, but I have not jumped in yet. It is on my roadmap.

jedisct1 commented 6 years ago

Saw that.

SHA512 will be great, but SHA3 support in ARMv8.4 is even more exciting.

Do you happen to know if all we will get is the hash function, or if they will actually expose the permutation? That would be a game changer.