Closed star26bsd closed 6 months ago
Not optmized yet, the arm64 binary is significantly slower than the x86_64 code.
Probably have to write native neon or do something like https://github.com/DLTcollab/sse2neon.
Would probably prefer native neon in the long run, though don't have an ARM box to verify with.
Probably have to write native neon or do something like https://github.com/DLTcollab/sse2neon. Would probably prefer native neon in the long run, though don't have an ARM box to verify with.
A native NEON approach seems reasonable, especially as some intrinsics are rather simple to port. Here's an example for InputReLu()
+#include <arm_neon.h>
+INLINE void InputReLU(int8_t* outputs, Accumulator* acc, const int stm) {
+ const size_t WIDTH = sizeof(int16x8_t) / sizeof(acc_t);
+ const size_t CHUNKS = N_HIDDEN / WIDTH;
+ const int views[2] = {stm, !stm};
+
+ for (int v = 0; v < 2; v++) {^M
+ const int16x8_t* in = (int16x8_t*) acc->values[views[v]];
+ int8x8_t* out = (int8x8_t*) &outputs[N_HIDDEN * v];
+
+ for (size_t i = 0; i < CHUNKS; i++) {
+ int16x8_t s = vshrq_n_s16(in[i], 5);
+ int8x8_t packed = vqmovn_s16(s);
+ int8x8_t maxed = vmax_s8(packed, vdup_n_s8(0));
+ out[i] = maxed;
+ }
+ }
+}
(whereas others like _mm512_maddubs_epi16
seem to require more work). I am completely out of my depths here, but I am happy to help testing and compiling. I even don't know, if a performant ARM version is even a goal of yours :)
Cheers Stephan
Not optmized yet, the arm64 binary is significantly slower than the x86_64 code.