Open RodriMora opened 3 months ago
This is the same issue as #443.
The implementation uses shared experts, which would have to be added to the implementation. Since I don't have the hardware to actually run the model (quantized or otherwise), I'd be developing remotely, which is slow and awkward compared to local development where I have a debugger, profiler and all these other tools available. Not to mention, it would be expensive for the kind of server that I would need. Something like $100/day, with some big changes (i.e. quite a few days) needed due to the difficulty of calibrating a model with 162 experts per layer.
In the end I don't think there are that many people who could even run the model anyway, or afford to host it, so it seems like a waste of my time. Especially as it's the kind of model that really screams for CPU inference (using llama.cpp or whatever). You could build a fairly cheap CPU server with 256 GB of RAM and probably get quite reasonable speeds that way, since it's a sparse model.
Thanks a lot to take the time for the explanaition.
How about the "Lite" version?
https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
Is a 15B parameter, same architecture
Oh, I didn't know there was a smaller version. That does look more realistic. It would still need a lot of new code, so I'm not sure when exactly I can get to it. But definitely doable.
There are two small ones, one's aforementioned instruct, the other is base - https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Base
@turboderp - Regarding hardware, do you know of Matt Berman? He vlogs on AI - https://www.youtube.com/@matthew_berman - I emailed with him, it sounds like he may know folks who can help you out, like cloud providers, and some hardware companies, and would love to help. His contact info is there on the YouTube page, feel free to reach out.
Wellp guess I have to stick with some slow-mo gguf
FYI - The lite instruct model is amazingly good, easily the best coding model I've used, better than codestral and much faster (even on GGUF) but would really benefit from exllamav2's long-context and KV efficiency.
I have 1x 3090 (24GB) and 2x A4000 (2x16GB) if you need me to test anything / run some builds feel free to @ me or contact via profile.
I have 4x3090, EPYC 48core and 512GB ram system and could provide access too if needed.
Ultimately it's not hardware I need, it's time. I have no doubt that it's the best-model-ever, but so were Yi, Orion, Gemma, Starcoder, GemMoE, Cohere, DBRX, Phi... Granite? All the time I spent on those architectures may or may not have been worth it, but it definitely took time away from other improvements I'd like to make, and there are core aspects of the library that really need attention as well. And with every new architecture I implement just in time for everyone to become disenchanted with it, I also add technical debt.
So yeah, I'm hesitant. :shrug: Maybe. Just not right now, unless someone wants to contribute some code.
Totally understandable. Thank you for the response. There will be other great models in the future :)
Hi!
When trying to quantize the new Deepseel Coder V2 https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct I got the following error:
Would it be possible to add support?