Open TimyIsCool opened 1 year ago
I think V2 is in the works. Not sure if it will have support for P40 but then again, you have llama.cpp that is all FP32 and I can run Q5KM and Q6 quants on it. If you apply the peer access patch it even does direct transfers on linux. For nvlink it's faster than exllama. Some downsides in how it processes prompts and mem efficiency but other than that, you can use it today.
Was wondering what the current progress was on the rewrite and if this could be turned into some sort of tracker for it? optimizations for the P40 seems to be something many would like