turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

DeepSeek V2 support #443

Open SinanAkkoyun opened 1 month ago

SinanAkkoyun commented 1 month ago

Hey, it would be awesome if https://github.com/deepseek-ai/DeepSeek-V2 would be supported if it's not too much work, I'd really like to quantize and publish it. (Also, is multi-GPU quantization possible? If not, idk if the 236B MoE parameters will even fit to quantize on a 80GB GPU given Mixtrals VRAM requirement)

turboderp commented 1 month ago

It would be a considerable amount of work. I don't think the model is too large to quantize (just slow), but the architecture would require a bunch more code to support and I'm not sure it's worth it since inference afterwards would still require a huge amount of VRAM.

You'd need 2x80 GB to run it at 4-bit precision, probably, and while the speed might be impressive compared to a hypothetical dense 236B model, I don't know if there's a lot of demand for that to justify all the effort to get it working. I couldn't even test it locally.

SinanAkkoyun commented 1 month ago

I totally see and understand, thank you for your assessment

SinanAkkoyun commented 1 month ago

I just found this lite version: https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite ofc only if it's interesting :)

laoda513 commented 3 weeks ago

I can help to test if you need.. I have 42080ti and 43090