Open Narsil opened 1 year ago
Hey Opening an issue instead of a PR for this one because it's super dirty work atm:
Basically on neon aarch64 (M1 Mac) we can add pure f16 intrinsics and get pretty sizeable speedup: Something like ~2x on most matmuls
https://github.com/LaurentMazare/gemm/pull/4
However this requires hacking new intrinsics using arm! macro which seem to confuse the compiler (most likely because I didn't write them properly)
arm!
Hey Opening an issue instead of a PR for this one because it's super dirty work atm:
Basically on neon aarch64 (M1 Mac) we can add pure f16 intrinsics and get pretty sizeable speedup: Something like ~2x on most matmuls
https://github.com/LaurentMazare/gemm/pull/4
However this requires hacking new intrinsics using
arm!
macro which seem to confuse the compiler (most likely because I didn't write them properly)