Support for architecture DeepseekV2ForCausalLM

turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

MIT License

3.52k stars 271 forks source link

Support for architecture DeepseekV2ForCausalLM #512

Open RodriMora opened 3 months ago

RodriMora commented 3 months ago

Hi!

When trying to quantize the new Deepseel Coder V2 https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct I got the following error:

 !! Warning, unknown architecture: DeepseekV2ForCausalLM
 !! Loading as LlamaForCausalLM

Would it be possible to add support?

turboderp commented 3 months ago

This is the same issue as #443.

The implementation uses shared experts, which would have to be added to the implementation. Since I don't have the hardware to actually run the model (quantized or otherwise), I'd be developing remotely, which is slow and awkward compared to local development where I have a debugger, profiler and all these other tools available. Not to mention, it would be expensive for the kind of server that I would need. Something like $100/day, with some big changes (i.e. quite a few days) needed due to the difficulty of calibrating a model with 162 experts per layer.

In the end I don't think there are that many people who could even run the model anyway, or afford to host it, so it seems like a waste of my time. Especially as it's the kind of model that really screams for CPU inference (using llama.cpp or whatever). You could build a fairly cheap CPU server with 256 GB of RAM and probably get quite reasonable speeds that way, since it's a sparse model.

RodriMora commented 3 months ago

Thanks a lot to take the time for the explanaition.

How about the "Lite" version?

https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct

Is a 15B parameter, same architecture

turboderp commented 3 months ago

Oh, I didn't know there was a smaller version. That does look more realistic. It would still need a lot of new code, so I'm not sure when exactly I can get to it. But definitely doable.

nktice commented 3 months ago

There are two small ones, one's aforementioned instruct, the other is base - https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Base

nktice commented 3 months ago

@turboderp - Regarding hardware, do you know of Matt Berman? He vlogs on AI - https://www.youtube.com/@matthew_berman - I emailed with him, it sounds like he may know folks who can help you out, like cloud providers, and some hardware companies, and would love to help. His contact info is there on the YouTube page, feel free to reach out.

matbee-eth commented 3 months ago

Wellp guess I have to stick with some slow-mo gguf

sammcj commented 3 months ago

FYI - The lite instruct model is amazingly good, easily the best coding model I've used, better than codestral and much faster (even on GGUF) but would really benefit from exllamav2's long-context and KV efficiency.

I have 1x 3090 (24GB) and 2x A4000 (2x16GB) if you need me to test anything / run some builds feel free to @ me or contact via profile.

RodriMora commented 3 months ago

I have 4x3090, EPYC 48core and 512GB ram system and could provide access too if needed.

turboderp commented 3 months ago

Ultimately it's not hardware I need, it's time. I have no doubt that it's the best-model-ever, but so were Yi, Orion, Gemma, Starcoder, GemMoE, Cohere, DBRX, Phi... Granite? All the time I spent on those architectures may or may not have been worth it, but it definitely took time away from other improvements I'd like to make, and there are core aspects of the library that really need attention as well. And with every new architecture I implement just in time for everyone to become disenchanted with it, I also add technical debt.

So yeah, I'm hesitant. :shrug: Maybe. Just not right now, unless someone wants to contribute some code.

sammcj commented 3 months ago

Totally understandable. Thank you for the response. There will be other great models in the future :)