vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
686 stars 58 forks source link

deepseek v2.5 quantization example can't run with vllm==0.5.4 #857

Closed dengzheng-cloud closed 1 month ago

dengzheng-cloud commented 1 month ago

Describe the bug i have used example/deepseek-moe-w4a16.py to quantize deepseek-v2.5 (1011G 55tensors) into deepseek-v2.5 (112G 24tensors), then run with vllm==0.5.4 got error like File "/data/miniconda3/envs/vllm0.5.4/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 186, in init assert self.quant_method is not None AssertionError
/data/miniconda3/envs/vllm0.5.4/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_m emory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' but after checked config.json, quant_method is compressed-tensors, which is the same with https://huggingface.co/nm-testing/DeepSeek-V2.5-W4A16/blob/main/config.json

Expected behavior vllm should load quantized model, while got error above.

Environment Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]:
  2. Python version [e.g. 3.7]: 3.10
  3. LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: 0.2.0
  4. ML framework version(s) [e.g. torch 2.3.1]: 2.4.1
  5. Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]: vLLM 0.5.4, compressed-tensors 0.7.0
  6. Other relevant environment information [e.g. hardware, CUDA version]:12.1

To Reproduce Exact steps to reproduce the behavior: pyhton examples/quantizing_moe/deepseek_moe_w4a16.py change dataset into local load. vllm serve /model_path --trust-remote-code --tensorparallel-size 2

Errors File "/data/miniconda3/envs/vllm0.5.4/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 186, in init assert self.quant_method is not None AssertionError
/data/miniconda3/envs/vllm0.5.4/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_m emory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

Additional context i had exchange dataset during quantization, while oneshot did save the model, i thought that would be fine, if it's not ,please let me know, thx

add a short cut of quantization result, i removed vllm verification. image

robertgshaw2-neuralmagic commented 1 month ago

We did not support quantization of deepseek on version v0.5.4. Please update your version of vllm

dengzheng-cloud commented 4 weeks ago

it did help, vllm 063 does work, cuda graph will cause oom error in 8xa100, maybe need to modify default gpu memory utilization, enforce eager could generate correctly.