M1 f16 intrinsics - Githubissues

Hey Opening an issue instead of a PR for this one because it's super dirty work atm:

Basically on neon aarch64 (M1 Mac) we can add pure f16 intrinsics and get pretty sizeable speedup: Something like ~2x on most matmuls

However this requires hacking new intrinsics using arm! macro which seem to confuse the compiler (most likely because I didn't write them properly)

sarah-quinones / gemm