Open huoyushequ opened 10 months ago
change from [[Simd<f32, 8>; B]; IN]
to [[Simd<f32, 8>; B]; IN/8]
can greatly reduce the stack usage and slightly enhance the reference speed.
This is a good change, does it impact speed?
I was considering adding a rust gated feature that lets you do generic constant arithmetic, but this solution is probably better for now.
i tried the `#![feature(generic_const_exprs)]
but it generates a bunch of warnings. On my desktop, the speed changed from achieved tok/s: 4.8163757
to achieved tok/s: 5.0361977
sorry! I'll get this checked in. just made a bunch of other changes that I need to merge in.
this proj is a great place to learn rust and llama and cuda(triton), very appreciated, hope to do something helpful to the proj
Would love any contribution, I'm also learning Rust and Triton on the fly.
What if we try this library? It seems pretty cool. https://docs.rs/typenum/latest/typenum/
Another idea would be to explore adding testing. Not sure how unit tests work in rust, but it would be nice to have these for small sizes.
Would love any contribution, I'm also learning Rust and Triton on the fly.
What if we try this library? It seems pretty cool. https://docs.rs/typenum/latest/typenum/
import of the typenum
would unnecessarily complicates the repo and make the code unintuitive just to bypass the limit of generic_const_expr
Another idea would be to explore adding testing. Not sure how unit tests work in rust, but it would be nice to have these for small sizes.
I am happy to write some unit test after i carefully finish the reading source code
the SIMD_8 is used in the method
matvec
ofQLinear
, so the input x with (B,IN) should transformed into[[Simd<f32, 8>; B]; IN/8]