turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

Progress on the rewrite for older cards (Like the P40) #279

Open TimyIsCool opened 1 year ago

TimyIsCool commented 1 year ago

Was wondering what the current progress was on the rewrite and if this could be turned into some sort of tracker for it? optimizations for the P40 seems to be something many would like

Ph0rk0z commented 1 year ago

I think V2 is in the works. Not sure if it will have support for P40 but then again, you have llama.cpp that is all FP32 and I can run Q5KM and Q6 quants on it. If you apply the peer access patch it even does direct transfers on linux. For nvlink it's faster than exllama. Some downsides in how it processes prompts and mem efficiency but other than that, you can use it today.