Closed rofirrim closed 2 months ago
@nick-knight can you take a closer look at the matrix multiplication? I took the existing example but it presumes a transposed matrix in B (I think) so I reimplemented it for a more naive approach so we can show a strided access. Also I understand partial accumulations should use tail undisturbed (regardless of the fact that this is a vfmacc with already 3 input operands).
@rofirrim I've always strongly disliked the matmul example in this repo: it's not how a sane person would implement it.
I agree with @nick-knight, I think we should stick with code we can confidently say will portable perform at close to the peak performance.
I agree with @nick-knight, I think we should stick with code we can confidently say will portable perform at close to the peak performance.
We are using vector instructions because there is an assumption that those can speed-up our applications. However, it is going to be risky to make claims about performance in the examples. Different implementations will expose different performance characterístics and we do not want to/can cater to each one.
I think the examples should be that, examples, and not necessarily a reference or library of efficient functions. The matrix multiply example is intentionally qualifed as "naive" in the examples for this reason.
Hi @kito-cheng thanks a lot for merging this for me.
I will update v1.0.x
to the current main
so not to stall further work (such as bf16).
The examples are non-normative. I've taken a subset of the examples in the examples directory of the repository.
This fixes #319