[Model] Add support for DeepSeek-V2 Model

rickzx commented 1 month ago

This PR implements the DeepSeek-V2 Model architecture: https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite/blob/main/modeling_deepseek.py.

The notable changes from the common LLM architecture includes:

Multihead Latent Attention (MLA)
Yarn Rotary Positional Embeddings
DeepSeekMoE

Example execution on M2 ultra:

% mlc_llm chat ../models/DeepSeek-V2-Lite-Chat-MLC-q0f16 --model-lib ../models/DeepSeek-V2-Lite-Chat-MLC-q
0f16/model.dylib
>>> who are you?
 I am an AI assistant created by DeepSeek to be helpful and harmless.

TODO:

Currently the model architecture only supports Deepseek-V2-Lite. To support Deepseek-V2, we also need to support the group_limited_greedy strategy.
Support tensor parallel > 1.

fengyang95 commented 1 month ago

Hello @rickzx , I noticed that the PR has been merged, and I wanted to ask if the todo items mentioned have been completed. Is support now available for deepseek-v2 (not the lite version)?

rickzx commented 1 month ago

Hello @rickzx , I noticed that the PR has been merged, and I wanted to ask if the todo items mentioned have been completed. Is support now available for deepseek-v2 (not the lite version)?

Hi @fengyang95, the TODO items mentioned above hasn't been completed yet. I plan to finish those items asap, hopefully by end of week. The non-lite version is not supported with the current version

fengyang95 commented 1 month ago

Hello @rickzx , I noticed that the PR has been merged, and I wanted to ask if the todo items mentioned have been completed. Is support now available for deepseek-v2 (not the lite version)?

Hi @fengyang95, the TODO items mentioned above hasn't been completed yet. I plan to finish those items asap, hopefully by end of week. The non-lite version is not supported with the current version

hi @rickzx Is there any latest update? Looking forward to using the non-lite version soon.

dylanlanigansmith commented 1 month ago

Hey! Really excited to try this model out (deepseek-v2-lite), but on the two CUDA machines I have tried it on I get this error as soon as I try and run inference:

terminate called after throwing an instance of N3tvm7runtime5ErrorE
  what(): TVMError: parallel_for failed: cudaErrorStreamCaptureImplicit: operation would make the legacy stream depend on a capturing blocking stream
Stack trace:
  [bt] (0) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff8ec90830]
  [bt] (1) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(TVMThrowLastError+0x4ac) [0xffff8ec2c00c]
  [bt] (2) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(+0x29dad8) [0xffff8ec8dad8]
  [bt] (3) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x84) [0xffff8ed31e14]
  [bt] (4) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)+0x794) [0xffff8ed350a4]
  [bt] (5) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()+0x238) [0xffff8ed358e8]
  [bt] (6) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)+0x1b8) [0xffff8ed35e78]
  [bt] (7) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(+0x346684) [0xffff8ed36684]
  [bt] (8) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x2e8) [0xffff8ed32078]

for reference it is compiled with q0f16 and I am using latest MLC. Loads up fine but as soon as I send a generation request it consistently crashes. Let me know if I can provide more info.

mlc-ai / mlc-llm

[Model] Add support for DeepSeek-V2 Model #2972