It's possible to just use a simd value directly without needing to allocate/load/store the buffer. This gives me a 10-15% lift on ubuntu which takes me to ~300 tok/s
Thanks for PR !
It got me around 10% spedup on playground environment. But on other VM I didn't see speedup.
But anyway, this change makes sense, I'm happy to merge.
It's possible to just use a simd value directly without needing to allocate/load/store the buffer. This gives me a 10-15% lift on ubuntu which takes me to ~300 tok/s