Closed rickzx closed 1 month ago
Hello @rickzx , I noticed that the PR has been merged, and I wanted to ask if the todo items mentioned have been completed. Is support now available for deepseek-v2 (not the lite version)?
Hello @rickzx , I noticed that the PR has been merged, and I wanted to ask if the todo items mentioned have been completed. Is support now available for deepseek-v2 (not the lite version)?
Hi @fengyang95, the TODO items mentioned above hasn't been completed yet. I plan to finish those items asap, hopefully by end of week. The non-lite version is not supported with the current version
Hello @rickzx , I noticed that the PR has been merged, and I wanted to ask if the todo items mentioned have been completed. Is support now available for deepseek-v2 (not the lite version)?
Hi @fengyang95, the TODO items mentioned above hasn't been completed yet. I plan to finish those items asap, hopefully by end of week. The non-lite version is not supported with the current version
hi @rickzx Is there any latest update? Looking forward to using the non-lite version soon.
Hey! Really excited to try this model out (deepseek-v2-lite), but on the two CUDA machines I have tried it on I get this error as soon as I try and run inference:
terminate called after throwing an instance of N3tvm7runtime5ErrorE
what(): TVMError: parallel_for failed: cudaErrorStreamCaptureImplicit: operation would make the legacy stream depend on a capturing blocking stream
Stack trace:
[bt] (0) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff8ec90830]
[bt] (1) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(TVMThrowLastError+0x4ac) [0xffff8ec2c00c]
[bt] (2) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(+0x29dad8) [0xffff8ec8dad8]
[bt] (3) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x84) [0xffff8ed31e14]
[bt] (4) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)+0x794) [0xffff8ed350a4]
[bt] (5) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()+0x238) [0xffff8ed358e8]
[bt] (6) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)+0x1b8) [0xffff8ed35e78]
[bt] (7) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(+0x346684) [0xffff8ed36684]
[bt] (8) /home/dylan/mlc-llm/build/tvm/libtvm_runtime.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x2e8) [0xffff8ed32078]
for reference it is compiled with q0f16 and I am using latest MLC. Loads up fine but as soon as I send a generation request it consistently crashes. Let me know if I can provide more info.
This PR implements the DeepSeek-V2 Model architecture: https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite/blob/main/modeling_deepseek.py.
The notable changes from the common LLM architecture includes:
Example execution on M2 ultra:
TODO:
group_limited_greedy
strategy.