Open kstavro opened 1 year ago
@kstavro can you try without the array stuff? putting data on the heap should avoid the stack overflow
can you try without the array stuff?
@sarah-ek Could you elaborate a bit what you mean? Do you mean getting rid of the conversions of the vecs to arrays? Without those, the gemm function complains, as it expects arrays as inputs.
By the way, once you explained that I am allocating too much inside the stack with the arrays, I tried with smaller params and the overflow problem went away. Unfortunately, now I am getting (exit code: 0xc000013a, STATUS_CONTROL_C_EXIT)
, eg with
const DST_LEN: usize = 65536;
const LHS_LEN: usize = 4096;
const RHS_LEN: usize = 64;
@sarah-ek after having to do a little bit of my own research to understand what everything should mean in the gemm call and debug the above, I realized that:
$mut [T]
or $[T]
inside gemm for its matmul (but why don't I get a stack overflow there even if I copy the exact same parameters and generate matrices of the same dimensions?).DST_LEN
(dst probably standing for destination?) has to be equal to m*n
. Setting it back to 262144 = 4096*64
made the loop work.I can confirm steady 9-10% CPU utilization like over at candle. Not sure if this has to do with block/stack optimization of the 5800x3D which has quite more cache than normal commercial CPUs.
It seems that the CPU utilization bottleneck in the above example is k=1
. This makes it practically a dot product and so gevv
is then called.
It seems that gevv
doesn't implement any parallelism, just SIMD. Maybe it would help to introduce some parallelism there for large vectors, as in the example above? Once I increase k
to at least 3, I directly go to >96% CPU util (k=8
->99% and k=16
-> 100%).
I think what happens with the inference of llama over at candle being fixed at 9% CPU utilization when inference of the new tokens starts and kv_cache
kicks in, is that with having a kv_cache
most of the matmuls are actually vector-matrix matmuls. I assume that there gemv
kicks in? As far as I can see from the code, gemv
also only relies on SIMD, which would explain the CPU util.
Passing here some gemm
input from candle when inferencing new tokens for reference (there are some big matmuls as well):
gevv doesn't parallelize becaues the computation is memory-bound, and doesn't benefit much from parallelism
Ok, I see. And what about gemv
?
same thing.
Coming here after noticing that CPU inference in the llama example over at candle only utilizes 10% of my CPU (AMD Ryzen 5800X3D). As I mentioned over at the candle repo, this might be because the implementation of
gemm
only needs to really use a specific amount of cores due to stack management/limitations? Could the 10% CPU utilization make sense? I have notice there is another PR in the repo here, where the number of threads gets upper bounded for some reason that might have to do with the stack, which is not something one might understand just by glancing the code. So, not sure it this is related.I have tried to implement a minimal
gemm
example by simulating the matmul from the llama example by copying all the parameters for the respectivegemm
that takes place during inference, but I get stack overflow, so I am already a bit out of my league here, since I have no idea why this happens.For reference, here is the issue from candle: (huggingface/candle#1103)
And here is the example I tried to recreate, in case you can correct it on the spot or it might help to reproduce the low CPU utilization.
A llama gemm attempt (that sadly overflows the stack)
```rust use gemm_common::Parallelism; use gemm_f16::gemm::f16::fma::gemm_basic; // I made fma public so that I can import it use half::f16; use rand_distr::{Distribution, Normal}; use std::convert::TryInto; fn convert_to_array