When loading DeepSeek V2 Lite, weights convert and compile fine, but the first attempt to do inference always fails with this. My suspicion is that there's an issue with the Relax implementation of the model somewhere.
To Reproduce
Steps to reproduce the behavior:
Download DeepSeek-V2-Lite from HuggingFace
Use MLC CLI to convert weights, generate config, and compile
Try doing inference either with mlc_llm chat, MLCEngine, or other
Expected behavior
I load the model and it gives me tokens back
Environment
Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
Operating system (e.g. Ubuntu/Windows/MacOS/...): Debian and Ubuntu
How you installed MLC-LLM (conda, source): Tried both pip and source
How you installed TVM-Unity (pip, source): Tried both pip and source
Python version (e.g. 3.10): 3.10
CUDA/cuDNN version (if applicable): Tried 12.4 and 12.6
🐛 Bug
When loading DeepSeek V2 Lite, weights convert and compile fine, but the first attempt to do inference always fails with this. My suspicion is that there's an issue with the Relax implementation of the model somewhere.
To Reproduce
Steps to reproduce the behavior:
mlc_llm chat
, MLCEngine, or otherExpected behavior
I load the model and it gives me tokens back
Environment
conda
, source): Tried both pip and sourcepip
, source): Tried both pip and source