Open SinanAkkoyun opened 1 month ago
It would be a considerable amount of work. I don't think the model is too large to quantize (just slow), but the architecture would require a bunch more code to support and I'm not sure it's worth it since inference afterwards would still require a huge amount of VRAM.
You'd need 2x80 GB to run it at 4-bit precision, probably, and while the speed might be impressive compared to a hypothetical dense 236B model, I don't know if there's a lot of demand for that to justify all the effort to get it working. I couldn't even test it locally.
I totally see and understand, thank you for your assessment
I just found this lite version: https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite ofc only if it's interesting :)
I can help to test if you need.. I have 42080ti and 43090
Hey, it would be awesome if https://github.com/deepseek-ai/DeepSeek-V2 would be supported if it's not too much work, I'd really like to quantize and publish it. (Also, is multi-GPU quantization possible? If not, idk if the 236B MoE parameters will even fit to quantize on a 80GB GPU given Mixtrals VRAM requirement)