randombit / botan

Cryptography Toolkit
https://botan.randombit.net
BSD 2-Clause "Simplified" License
2.58k stars 568 forks source link

Add Power8 AES Encryption #1206

Closed noloader closed 6 years ago

noloader commented 7 years ago

Attached and below is a patch for AES using Power8 built-ins. Its another partial patch, and it hijacks the C++ implementation. Others will have to complete it.

This patch only provides the forward transformation or encryption. Botan and Crypto++ both use the "Equivalent Inverse Cipher" (FIPS-197, Section 5.3.5, p.23), and it is not compatible with IBM hardware. Both libraries will need to re-work the decryption key scheduling routines. (It could be a simple as using the encryption key table for decryption. I have not investigated it yet).


The patch looks awful because there are two abstraction layers. The first deals with GCC, xlC and platform endianess. GCC and xlC have different datatypes and built-ins. GCC only has the 64-bit types, while xlC has all the types, including the 8-bit types. For GCC we have to do a fair amount of work for endian conversions.

xlC does not have the endian problems because we can load the buffer as a array of 8-bit elements using either vec_xl_be or vec_xl. When needed with xlC, we can perform one permute using vec_reve. GCC lacks that sort of ease of use.

The second layer abstracts the higher level operations, like VectorEncrypt, VectorDecrypt, VectorEncryptLast and VectorDecryptLast.

The abstraction layers were a lot like trying to put lipstick on a pig. There's really no way to make the code look elegant when xlC had one set of primitives and operations, and GCC had another which was merely a subset.


The VMX unit is very sensitive to buffer alignments. For Power7 and VMX buffers must be aligned. Power8 and VMS is a little more tolerant and you can load unaligned buffers using different instructions. The different built-ins are vec_xl_be and vec_xl for xlC; and vec_vsx_ld for GCC. GCC has to permute after the load on little-endian systems.

The code assumes an aligned key buffer because the library controls it. The code does not assume aligned buffers for in and out because the user controls them. The code asserts at runtime in debug builds:

BOTAN_ASSERT((size_t)EK.data() % 16 == 0, "Oops");
const uint8_t* ek = reinterpret_cast<const uint8_t*>(EK.data());
...
VectorType  k = VectorLoadAligned(ek);

IF the key buffers are not aligned, then you can change to the following:

VectorType  k = VectorLoad(ek);

I believe AIX and xlC behave differently than Linux and GCC, so be sure to test both systems. I was seeing 2-byte and 4-byte alignments under AIX and xlC.

To reiterate, if you cannot guarantee aligned buffers, then use VectorLoad instead of VectorLoadAligned.


Here are the numbers from GCC112 (compile farm), which is a 3.4 GHz IBM POWER System S822. Botan was configured with ./configure.py --cc=gcc --cc-abi="-mcpu=power8 -maltivec".

If my calculations are correct, Botan is pushing data at about 1 cpb for AES-128 on the machine. Andy Polyakov has OpenSSL running at about 0.7 cpb, so I think it is a very respectable number.

$ ./botan speed --msec=3000 AES-128 AES-192 AES-256
AES-128 [base] encrypt buffer size 4096 bytes: 3262.216 MiB/sec (9786.688 MiB in 3000.012 ms)
AES-128 [base] decrypt buffer size 4096 bytes: 148.706 MiB/sec (446.125 MiB in 3000.044 ms)
AES-192 [base] encrypt buffer size 4096 bytes: 2893.181 MiB/sec (8679.562 MiB in 3000.006 ms)
AES-192 [base] decrypt buffer size 4096 bytes: 127.099 MiB/sec (381.312 MiB in 3000.122 ms)
AES-256 [base] encrypt buffer size 4096 bytes: 2449.247 MiB/sec (7347.750 MiB in 3000.004 ms)
AES-256 [base] decrypt buffer size 4096 bytes: 111.480 MiB/sec (334.500 MiB in 3000.551 ms)

algo                          operation  4096 bytes
AES-128 [base]                decrypt    155929.68
AES-192 [base]                decrypt    133272.98
AES-256 [base]                decrypt    116894.76
AES-128 [base]                encrypt    3420681.92
AES-192 [base]                encrypt    3033720.64
AES-256 [base]                encrypt    2568221.13

On AIX you can call the following to get the L1 data cache line size. On GCC112 it returns 128!

#ifdef _AIX
#  include <sys/systemcfg.h>
#endif
...

#ifdef _AIX
  // /usr/include/sys/systemcfg.h
  cacheLineSize = getsystemcfg(SC_L1C_DLS);
#endif

Glibc prior to 2.24 does not offer AT_HWCAP or AT_HWCAP2 to signal Power8 in-core crypto, so its fudged in the patch. The define of interest is PPC_FEATURE2_VEC_CRYPTO on Linux. I also don't know how to query it on AIX.

$ git diff > aes-p8.diff
$ cat aes-p8.diff
diff --git a/src/lib/block/aes/aes.cpp b/src/lib/block/aes/aes.cpp
index 0878d84ae..69fbfd27c 100644
--- a/src/lib/block/aes/aes.cpp
+++ b/src/lib/block/aes/aes.cpp
@@ -12,6 +12,21 @@
 #include <botan/cpuid.h>
 #include <botan/internal/bit_ops.h>

+#include <altivec.h>
+#undef vector
+#undef pixel
+#undef bool
+
+#if defined(__linux__)
+# include <sys/auxv.h>
+#endif
+
+# define BOTAN_HAS_AES_POWER8 1
+
+// GCC and IBM XL C/C++ command lines
+//  ./configure.py --cc=gcc --cc-abi="-mcpu=power8 -maltivec"
+//  ./configure.py --cc=xlc --cc-abi="-qarch=pwr8 -qaltivec"
+
 /*
 * This implementation is based on table lookups which are known to be
 * vulnerable to timing and cache based side channel attacks. Some
@@ -45,6 +60,263 @@

 namespace Botan {

+//////////////////////////////////////////////////////////////////
+
+namespace {
+
+// http://elixir.free-electrons.com/linux/latest/source/arch/powerpc/include/uapi/asm/cputable.h
+#ifndef PPC_FEATURE2_VEC_CRYPTO
+# define PPC_FEATURE2_VEC_CRYPTO  0x02000000
+#endif
+
+inline bool HasPower8()
+{
+   // Power8 and ISA 2.07 provide in-core crypto. I believe
+   // Glibc 2.24 provides the HWCAP define. Also see
+   // https://stackoverflow.com/q/46144668/608639
+#if defined(__linux__)
+   if (getauxval(AT_HWCAP2) & PPC_FEATURE2_ARCH_2_07)
+      return true;
+   if (getauxval(AT_HWCAP2) & PPC_FEATURE2_VEC_CRYPTO)
+      return true;
+#endif
+   return true;
+}
+
+// IBM XL C/C++ compiler
+#if defined(__xlc__) || defined(__xlC__)
+# define BOTAN_XLC_VERSION 1
+#endif
+
+typedef __vector unsigned char      uint8x16_p8;
+typedef __vector unsigned long long uint64x2_p8;
+
+//////////////////////////////////////////////////////////////////
+
+/* Reverses a 16-byte array as needed */
+void ByteReverseArrayLE(uint8_t dest[16], const uint8_t src[16])
+{
+#if defined(BOTAN_XLC_VERSION) && defined(BOTAN_TARGET_CPU_IS_LITTLE_ENDIAN)
+   vec_st(vec_reve(vec_ld(0, src)), 0, dest);
+#elif defined(BOTAN_TARGET_CPU_IS_LITTLE_ENDIAN)
+   const uint8x16_p8 mask = {15,14,13,12, 11,10,9,8, 7,6,5,4, 3,2,1,0};
+   const uint8x16_p8 zero = {0};
+   vec_vsx_st(vec_perm(vec_vsx_ld(0, src), zero, mask), 0, dest);
+#else
+   if (src != dest)
+      std::memcpy(dest, src, 16);
+#endif
+}
+
+void ByteReverseArrayLE(uint8_t src[16])
+{
+#if defined(BOTAN_XLC_VERSION) && defined(BOTAN_TARGET_CPU_IS_LITTLE_ENDIAN)
+   vec_st(vec_reve(vec_ld(0, src)), 0, src);
+#elif defined(BOTAN_TARGET_CPU_IS_LITTLE_ENDIAN)
+   const uint8x16_p8 mask = {15,14,13,12, 11,10,9,8, 7,6,5,4, 3,2,1,0};
+   const uint8x16_p8 zero = {0};
+   vec_vsx_st(vec_perm(vec_vsx_ld(0, src), zero, mask), 0, src);
+#endif
+}
+
+uint8x16_p8 Load8x16(const uint8_t src[16])
+{
+#if defined(BOTAN_XLC_VERSION)
+   /* http://stackoverflow.com/q/46124383/608639 */
+   return vec_xl_be(0, (uint8_t*)src);
+#else
+   /* GCC, Clang, etc */
+   return (uint8x16_p8)vec_vsx_ld(0, src);
+#endif
+}
+
+uint8x16_p8 Load8x16(int off, const uint8_t src[16])
+{
+#if defined(BOTAN_XLC_VERSION)
+   /* http://stackoverflow.com/q/46124383/608639 */
+   return vec_xl_be(off, (uint8_t*)src);
+#else
+   /* GCC, Clang, etc */
+   return (uint8x16_p8)vec_vsx_ld(off, src);
+#endif
+}
+
+void Store8x16(const uint8x16_p8 src, uint8_t dest[16])
+{
+#if defined(BOTAN_XLC_VERSION)
+   vec_xst_be(src, 0, (uint8_t*)dest);
+#else
+   /* GCC, Clang, etc */
+   vec_vsx_st(src, 0, dest);
+#endif
+}
+
+uint64x2_p8 Load64x2(const uint8_t src[16])
+{
+#if defined(BOTAN_XLC_VERSION)
+   /* http://stackoverflow.com/q/46124383/608639 */
+   return (uint64x2_p8)vec_xl_be(0, (uint8_t*)src);
+#else
+   /* GCC, Clang, etc */
+# if defined(BOTAN_TARGET_CPU_IS_LITTLE_ENDIAN)
+   const uint8x16_p8 mask = {15,14,13,12, 11,10,9,8, 7,6,5,4, 3,2,1,0};
+   const uint8x16_p8 zero = {0};
+   return (uint64x2_p8)vec_perm(vec_vsx_ld(0, src), zero, mask);
+# else
+   return (uint64x2_p8)vec_vsx_ld(0, src);
+# endif
+#endif
+}
+
+uint64x2_p8 Load64x2(int off, const uint8_t src[16])
+{
+#if defined(BOTAN_XLC_VERSION)
+   /* http://stackoverflow.com/q/46124383/608639 */
+   return (uint64x2_p8)vec_xl_be(off, (uint8_t*)src);
+#else
+   /* GCC, Clang, etc */
+# if defined(BOTAN_TARGET_CPU_IS_LITTLE_ENDIAN)
+   const uint8x16_p8 mask = {15,14,13,12, 11,10,9,8, 7,6,5,4, 3,2,1,0};
+   const uint8x16_p8 zero = {0};
+   return (uint64x2_p8)vec_perm(vec_vsx_ld(off, src), zero, mask);
+# else
+   return (uint64x2_p8)vec_vsx_ld(off, src);
+# endif
+#endif
+}
+
+void Store64x2(const uint64x2_p8 src, uint8_t dest[16])
+{
+#if defined(BOTAN_XLC_VERSION)
+   vec_xst_be((uint8x16_p8)src, 0, (uint8_t*)dest);
+#else
+   /* GCC, Clang, etc */
+# if defined(BOTAN_TARGET_CPU_IS_LITTLE_ENDIAN)
+   const uint8x16_p8 mask = {15,14,13,12, 11,10,9,8, 7,6,5,4, 3,2,1,0};
+   const uint8x16_p8 zero = {0};
+   vec_vsx_st(vec_perm((uint8x16_p8)src, zero, mask), 0, dest);
+# else
+   vec_vsx_st((uint8x16_p8)src, 0, dest);
+# endif
+#endif
+}
+
+//////////////////////////////////////////////////////////////////
+
+#if defined(BOTAN_XLC_VERSION)
+ typedef uint8x16_p8 VectorType;
+#elif defined(BOTAN_GCC_VERSION)
+ typedef uint64x2_p8 VectorType;
+#endif
+
+// Loads a mis-aligned byte array, performs an endian conversion.
+inline VectorType VectorLoad(const uint8_t src[16])
+{
+#if defined(BOTAN_XLC_VERSION)
+   return Load8x16(src);
+#elif defined(BOTAN_GCC_VERSION)
+   return Load64x2(src);
+#endif
+}
+
+// Loads a mis-aligned byte array, performs an endian conversion.
+inline VectorType VectorLoad(int off, const uint8_t src[16])
+{
+#if defined(BOTAN_XLC_VERSION)
+   return Load8x16(off, src);
+#elif defined(BOTAN_GCC_VERSION)
+   return Load64x2(off, src);
+#endif
+}
+
+// Loads an aligned byte array, does not perform an endian conversion.
+//  This function presumes the subkey table is correct endianess.
+inline VectorType VectorLoadKey(const uint8_t vec[16])
+{
+   return (VectorType)vec_ld(0, vec);
+}
+
+// Loads an aligned byte array, does not perform an endian conversion.
+//  This function presumes the subkey table is correct endianess.
+inline VectorType VectorLoadKey(int off, const uint8_t vec[16])
+{
+   return (VectorType)vec_ld(off, vec);
+}
+
+// Stores to a mis-aligned byte array, performs an endian conversion.
+inline void VectorStore(const VectorType& src, uint8_t dest[16])
+{
+#if defined(BOTAN_XLC_VERSION)
+   return Store8x16(src, dest);
+#elif defined(BOTAN_GCC_VERSION)
+   return Store64x2(src, dest);
+#endif
+}
+
+template <class T1, class T2>
+inline T1 VectorXor(const T1& vec1, const T2& vec2)
+{
+   return (T1)vec_xor(vec1, (T1)vec2);
+}
+
+template <class T1, class T2>
+inline T1 VectorAdd(const T1& vec1, const T2& vec2)
+{
+   return (T1)vec_add(vec1, (T1)vec2);
+}
+
+template <class T1, class T2>
+inline T1 VectorEncrypt(const T1& state, const T2& key)
+{
+#if defined(BOTAN_XLC_VERSION)
+   return (T1)__vcipher(state, (T1)key);
+#elif defined(BOTAN_GCC_VERSION)
+   return (T1)__builtin_crypto_vcipher(state, (T1)key);
+#else
+   BOTAN_ASSERT(0);
+#endif
+}
+
+template <class T1, class T2>
+inline T1 VectorEncryptLast(const T1& state, const T2& key)
+{
+#if defined(BOTAN_XLC_VERSION)
+   return (T1)__vcipherlast(state, (T1)key);
+#elif defined(BOTAN_GCC_VERSION)
+   return (T1)__builtin_crypto_vcipherlast(state, (T1)key);
+#else
+   BOTAN_ASSERT(0);
+#endif
+}
+
+template <class T1, class T2>
+inline T1 VectorDecrypt(const T1& state, const T2& key)
+{
+#if defined(BOTAN_XLC_VERSION)
+   return (T1)__vncipher(state, (T1)key);
+#elif defined(BOTAN_GCC_VERSION)
+   return (T1)__builtin_crypto_vncipher(state, (T1)key);
+#else
+   BOTAN_ASSERT(0);
+#endif
+}
+
+template <class T1, class T2>
+inline T1 VectorDecryptLast(const T1& state, const T2& key)
+{
+#if defined(BOTAN_XLC_VERSION)
+   return (T1)__vncipherlast(state, (T1)key);
+#elif defined(BOTAN_GCC_VERSION)
+   return (T1)__builtin_crypto_vncipherlast(state, (T1)key);
+#else
+   BOTAN_ASSERT(0);
+#endif
+}
+
+}  // namespace
+
+//////////////////////////////////////////////////////////////////
+
 namespace {

 const uint8_t SE[256] = {
@@ -157,9 +429,11 @@ void aes_encrypt_n(const uint8_t in[], uint8_t out[],
    BOTAN_ASSERT(EK.size() && ME.size() == 16, "Key was set");

    const size_t cache_line_size = CPUID::cache_line_size();
-
    const std::vector<uint32_t>& TE = AES_TE();

+   const size_t rounds = EK.size() / 4;
+   BOTAN_ASSERT(rounds == 10 || rounds == 12 || rounds == 14, "Oops");
+
    // Hit every cache line of TE
    uint32_t Z = 0;
    for(size_t i = 0; i < TE.size(); i += cache_line_size / sizeof(uint32_t))
@@ -168,83 +442,68 @@ void aes_encrypt_n(const uint8_t in[], uint8_t out[],
       }
    Z &= TE[82]; // this is zero, which hopefully the compiler cannot deduce

-   for(size_t i = 0; i < blocks; ++i)
+   // Keys must be aligned on a 16-byte boundary. in and out can be unaligned.
+   BOTAN_ASSERT((size_t)ME.data() % 16 == 0, "Oops");
+   BOTAN_ASSERT((size_t)EK.data() % 16 == 0, "Oops");
+   const uint8_t* ek = reinterpret_cast<const uint8_t*>(EK.data());
+   const uint8_t* me = reinterpret_cast<const uint8_t*>(ME.data());
+
+   while(blocks >= 4)
       {
-      uint32_t T0, T1, T2, T3;
-      load_be(in + 16*i, T0, T1, T2, T3);
+      VectorType  k = VectorLoadKey(ek);
+      VectorType s1 = VectorLoad( 0, in);
+      VectorType s2 = VectorLoad(16, in);
+      VectorType s3 = VectorLoad(32, in);
+      VectorType s4 = VectorLoad(48, in);

-      T0 ^= EK[0];
-      T1 ^= EK[1];
-      T2 ^= EK[2];
-      T3 ^= EK[3];
+      s1 = VectorXor(s1, k);
+      s2 = VectorXor(s2, k);
+      s3 = VectorXor(s3, k);
+      s4 = VectorXor(s4, k);

-      T0 ^= Z;
+      for (size_t i=1; i<rounds; ++i)
+      {
+          k = VectorLoadKey(i*16, ek);
+         s1 = VectorEncrypt(s1, k);
+         s2 = VectorEncrypt(s2, k);
+         s3 = VectorEncrypt(s3, k);
+         s4 = VectorEncrypt(s4, k);
+      }

-      /* Use only the first 256 entries of the TE table and do the
-      * rotations directly in the code. This reduces the number of
-      * cache lines potentially used in the first round from 64 to 16
-      * (assuming a typical 64 byte cache line), which makes timing
-      * attacks a little harder; the first round is particularly
-      * vulnerable.
-      */
-
-      uint32_t B0 = TE[get_byte(0, T0)] ^
-                  rotate_right(TE[get_byte(1, T1)],  8) ^
-                  rotate_right(TE[get_byte(2, T2)], 16) ^
-                  rotate_right(TE[get_byte(3, T3)], 24) ^ EK[4];
-
-      uint32_t B1 = TE[get_byte(0, T1)] ^
-                  rotate_right(TE[get_byte(1, T2)],  8) ^
-                  rotate_right(TE[get_byte(2, T3)], 16) ^
-                  rotate_right(TE[get_byte(3, T0)], 24) ^ EK[5];
-
-      uint32_t B2 = TE[get_byte(0, T2)] ^
-                  rotate_right(TE[get_byte(1, T3)],  8) ^
-                  rotate_right(TE[get_byte(2, T0)], 16) ^
-                  rotate_right(TE[get_byte(3, T1)], 24) ^ EK[6];
-
-      uint32_t B3 = TE[get_byte(0, T3)] ^
-                  rotate_right(TE[get_byte(1, T0)],  8) ^
-                  rotate_right(TE[get_byte(2, T1)], 16) ^
-                  rotate_right(TE[get_byte(3, T2)], 24) ^ EK[7];
-
-      for(size_t r = 2*4; r < EK.size(); r += 2*4)
-         {
-         T0 = EK[r  ] ^ TE[get_byte(0, B0)      ] ^ TE[get_byte(1, B1) + 256] ^
-                        TE[get_byte(2, B2) + 512] ^ TE[get_byte(3, B3) + 768];
-         T1 = EK[r+1] ^ TE[get_byte(0, B1)      ] ^ TE[get_byte(1, B2) + 256] ^
-                        TE[get_byte(2, B3) + 512] ^ TE[get_byte(3, B0) + 768];
-         T2 = EK[r+2] ^ TE[get_byte(0, B2)      ] ^ TE[get_byte(1, B3) + 256] ^
-                        TE[get_byte(2, B0) + 512] ^ TE[get_byte(3, B1) + 768];
-         T3 = EK[r+3] ^ TE[get_byte(0, B3)      ] ^ TE[get_byte(1, B0) + 256] ^
-                        TE[get_byte(2, B1) + 512] ^ TE[get_byte(3, B2) + 768];
-
-         B0 = EK[r+4] ^ TE[get_byte(0, T0)      ] ^ TE[get_byte(1, T1) + 256] ^
-                        TE[get_byte(2, T2) + 512] ^ TE[get_byte(3, T3) + 768];
-         B1 = EK[r+5] ^ TE[get_byte(0, T1)      ] ^ TE[get_byte(1, T2) + 256] ^
-                        TE[get_byte(2, T3) + 512] ^ TE[get_byte(3, T0) + 768];
-         B2 = EK[r+6] ^ TE[get_byte(0, T2)      ] ^ TE[get_byte(1, T3) + 256] ^
-                        TE[get_byte(2, T0) + 512] ^ TE[get_byte(3, T1) + 768];
-         B3 = EK[r+7] ^ TE[get_byte(0, T3)      ] ^ TE[get_byte(1, T0) + 256] ^
-                        TE[get_byte(2, T1) + 512] ^ TE[get_byte(3, T2) + 768];
-         }
+       k = VectorLoadKey(0, me);
+      s1 = VectorEncryptLast(s1, k);
+      s2 = VectorEncryptLast(s2, k);
+      s3 = VectorEncryptLast(s3, k);
+      s4 = VectorEncryptLast(s4, k);
+
+      VectorStore(s1, out+0);
+      VectorStore(s2, out+16);
+      VectorStore(s3, out+32);
+      VectorStore(s4, out+48);
+
+      blocks -= 4;
+      in += 64; out += 64;
+      }

-      out[16*i+ 0] = SE[get_byte(0, B0)] ^ ME[0];
-      out[16*i+ 1] = SE[get_byte(1, B1)] ^ ME[1];
-      out[16*i+ 2] = SE[get_byte(2, B2)] ^ ME[2];
-      out[16*i+ 3] = SE[get_byte(3, B3)] ^ ME[3];
-      out[16*i+ 4] = SE[get_byte(0, B1)] ^ ME[4];
-      out[16*i+ 5] = SE[get_byte(1, B2)] ^ ME[5];
-      out[16*i+ 6] = SE[get_byte(2, B3)] ^ ME[6];
-      out[16*i+ 7] = SE[get_byte(3, B0)] ^ ME[7];
-      out[16*i+ 8] = SE[get_byte(0, B2)] ^ ME[8];
-      out[16*i+ 9] = SE[get_byte(1, B3)] ^ ME[9];
-      out[16*i+10] = SE[get_byte(2, B0)] ^ ME[10];
-      out[16*i+11] = SE[get_byte(3, B1)] ^ ME[11];
-      out[16*i+12] = SE[get_byte(0, B3)] ^ ME[12];
-      out[16*i+13] = SE[get_byte(1, B0)] ^ ME[13];
-      out[16*i+14] = SE[get_byte(2, B1)] ^ ME[14];
-      out[16*i+15] = SE[get_byte(3, B2)] ^ ME[15];
+   while(blocks--)
+      {
+      VectorType k = VectorLoadKey(ek);
+      VectorType s = VectorLoad(0, in);
+
+      s = VectorXor(s, k);
+
+      for (size_t i=1; i<rounds; ++i)
+      {
+         k = VectorLoadKey(i*16, ek);
+         s = VectorEncrypt(s, k);
+      }
+
+      k = VectorLoadKey(0, me);
+      s = VectorEncryptLast(s, k);
+
+      VectorStore(s, out+0);
+
+      in += 16; out += 16;
       }
    }

@@ -424,7 +683,30 @@ void aes_key_schedule(const uint8_t key[], size_t length,
          DK[i] = reverse_bytes(DK[i]);
       }
 #endif
-
+#if defined(BOTAN_HAS_AES_POWER8)
+   if(HasPower8())
+      {
+      // Power8 needs the subkeys to be byte reversed on 32-bit boundaries
+      for(size_t i = 0; i != EK.size(); ++i)
+         EK[i] = reverse_bytes(EK[i]);
+      //for(size_t i = 0; i != DK.size(); ++i)
+      //   DK[i] = reverse_bytes(DK[i]);
+
+#if defined(BOTAN_TARGET_CPU_IS_LITTLE_ENDIAN)
+      // ... And Power8 needs the all round keys to be 128-bit byte
+      //   reversed for little-endian in addition to the 32-bit reverse
+      //   that occurred earlier. It ensures proper endianess after a
+      //   round key is loaded into a VSX register.
+      ByteReverseArrayLE(reinterpret_cast<uint8_t*>(ME.data()));
+      //ByteReverseArrayLE(reinterpret_cast<uint8_t*>(MD.data()));
+
+      for(size_t i = 0; i < rounds; ++i)
+         ByteReverseArrayLE(reinterpret_cast<byte*>(EK.data())+i*16);
+      //for(size_t i = 0; i < rounds; ++i)
+      //   ByteReverseArrayLE(reinterpret_cast<byte*>(DK.data())+i*16);
+#endif  // BOTAN_TARGET_CPU_IS_LITTLE_ENDIAN
+      }
+#endif  // BOTAN_HAS_AES_POWER8
    }

 size_t aes_parallelism()

Here's the ZIP of the diff above: aes-p8.diff.zip.

randombit commented 7 years ago

Thanks! The initial research you've done here really helps with supporting this. Hopefully I will be able to finish this off before 2.3 release #1156

BTW it looks like in the last couple of months patches have been added to GCC for vec_xl_be and vec_reve, so this code can eventually become simpler. Though it will be a while yet before GCC 8 is released, much less before we can assume it. :/

noloader commented 7 years ago

@randombit,

I updated the patch and added some comments to VectorLoad and VectorLoadKey. VectorLoadKey is the former VectorLoadAligned. The comments and the rename are the update. Nothing else changed.

I also tested swapping VectorLoad and VectorLoadKey. They are not interchangeable because VectorLoad performs an endian swap as necessary, while VectorLoadKey assumes the key is in the correct endianess. Sorry about that.

noloader commented 7 years ago

@randombit,

... it looks like in the last couple of months patches have been added to GCC for vec_xl_be and vec_reve, so this code can eventually become simpler.

You were right about some of the built-in functions. I was able to refactor and clean them up a bit; see the aes-p8.c proof of concept. The demo is compatible going back to GCC 4.8 on GCC112 from the test farm. GCC 4.8 on GCC112 is what I have been using to test Botan :)

randombit commented 6 years ago

Done! Thanks again for the help jww