Add code to make Berserk compile on Apple Silicon out of the box.

star26bsd commented 6 months ago

Not optmized yet, the arm64 binary is significantly slower than the x86_64 code.

jhonnold commented 6 months ago

Not optmized yet, the arm64 binary is significantly slower than the x86_64 code.

Probably have to write native neon or do something like https://github.com/DLTcollab/sse2neon.

Would probably prefer native neon in the long run, though don't have an ARM box to verify with.

star26bsd commented 6 months ago

Probably have to write native neon or do something like https://github.com/DLTcollab/sse2neon. Would probably prefer native neon in the long run, though don't have an ARM box to verify with.

A native NEON approach seems reasonable, especially as some intrinsics are rather simple to port. Here's an example for InputReLu()

+#include <arm_neon.h>
+INLINE void InputReLU(int8_t* outputs, Accumulator* acc, const int stm) {
+    const size_t WIDTH = sizeof(int16x8_t) / sizeof(acc_t);
+    const size_t CHUNKS = N_HIDDEN / WIDTH;
+    const int views[2] = {stm, !stm};
+
+    for (int v = 0; v < 2; v++) {^M
+        const int16x8_t* in = (int16x8_t*) acc->values[views[v]];
+        int8x8_t* out = (int8x8_t*) &outputs[N_HIDDEN * v];
+
+        for (size_t i = 0; i < CHUNKS; i++) {
+            int16x8_t s = vshrq_n_s16(in[i], 5);
+            int8x8_t packed = vqmovn_s16(s);
+            int8x8_t maxed = vmax_s8(packed, vdup_n_s8(0));
+            out[i] = maxed;
+        }
+    }
+}

(whereas others like _mm512_maddubs_epi16 seem to require more work). I am completely out of my depths here, but I am happy to help testing and compiling. I even don't know, if a performant ARM version is even a goal of yours :)

Cheers Stephan

jhonnold / berserk

Add code to make Berserk compile on Apple Silicon out of the box. #534