sarah-quinones / gemm

MIT License
76 stars 11 forks source link

M1 f16 intrinsics #13

Open Narsil opened 1 year ago

Narsil commented 1 year ago

Hey Opening an issue instead of a PR for this one because it's super dirty work atm:

Basically on neon aarch64 (M1 Mac) we can add pure f16 intrinsics and get pretty sizeable speedup: Something like ~2x on most matmuls

https://github.com/LaurentMazare/gemm/pull/4

However this requires hacking new intrinsics using arm! macro which seem to confuse the compiler (most likely because I didn't write them properly)