srush / llama2.rs

A fast llama2 decoder in pure Rust.
MIT License
1.01k stars 56 forks source link

Attempt at a Prefill by expanding matrix expansion #15

Closed srush closed 1 year ago

srush commented 1 year ago

Ideally we shouldn't have to pay as much for prompt tokens.

This PR tries two different approaches, one by expanding the quantized matrix and doing a matmul and one by trying to just iterate over the prompt tokens. Guessing there is a length cutoff where one is better. Currently it seems to have a bug on some lengths.