Ideally we shouldn't have to pay as much for prompt tokens.
This PR tries two different approaches, one by expanding the quantized matrix and doing a matmul and one by trying to just iterate over the prompt tokens. Guessing there is a length cutoff where one is better. Currently it seems to have a bug on some lengths.
Ideally we shouldn't have to pay as much for prompt tokens.
This PR tries two different approaches, one by expanding the quantized matrix and doing a matmul and one by trying to just iterate over the prompt tokens. Guessing there is a length cutoff where one is better. Currently it seems to have a bug on some lengths.