Closed dengzheng-cloud closed 1 month ago
We did not support quantization of deepseek on version v0.5.4. Please update your version of vllm
it did help, vllm 063 does work, cuda graph will cause oom error in 8xa100, maybe need to modify default gpu memory utilization, enforce eager could generate correctly.
Describe the bug i have used example/deepseek-moe-w4a16.py to quantize deepseek-v2.5 (1011G 55tensors) into deepseek-v2.5 (112G 24tensors), then run with vllm==0.5.4 got error like File "/data/miniconda3/envs/vllm0.5.4/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 186, in init assert self.quant_method is not None AssertionError
/data/miniconda3/envs/vllm0.5.4/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_m emory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' but after checked config.json, quant_method is compressed-tensors, which is the same with https://huggingface.co/nm-testing/DeepSeek-V2.5-W4A16/blob/main/config.json
Expected behavior vllm should load quantized model, while got error above.
Environment Include all relevant environment information:
f7245c8
]: 0.2.0To Reproduce Exact steps to reproduce the behavior: pyhton examples/quantizing_moe/deepseek_moe_w4a16.py change dataset into local load. vllm serve /model_path --trust-remote-code --tensorparallel-size 2
Errors File "/data/miniconda3/envs/vllm0.5.4/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 186, in init assert self.quant_method is not None AssertionError
/data/miniconda3/envs/vllm0.5.4/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_m emory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
Additional context i had exchange dataset during quantization, while oneshot did save the model, i thought that would be fine, if it's not ,please let me know, thx
add a short cut of quantization result, i removed vllm verification.