sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.9k stars 477 forks source link

[BUG] Marlin model quantized with AutoGPTQ is not loadable #289

Closed Qubitium closed 7 months ago

Qubitium commented 7 months ago

@qeternity In PR #286, Marlin kernel is merged but when is it actually used?

I have tested a marlin llama2 model (works on vllm) but not on latest sglang tip.

Traceback (most recent call last):
  File "/root/miniconda3/envs/marlin/lib/python3.11/site-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process
    model_client = ModelRpcClient(server_args, port_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/marlin/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 619, in __init__
    self.model_server.exposed_init_model(0, server_args, port_args)
  File "/root/miniconda3/envs/marlin/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 70, in exposed_init_model
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/root/miniconda3/envs/marlin/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 280, in __init__
    self.load_model()
  File "/root/miniconda3/envs/marlin/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 313, in load_model
    model.load_weights(
  File "/root/miniconda3/envs/marlin/lib/python3.11/site-packages/sglang/srt/models/llama2.py", line 318, in load_weights
    param = params_dict[name]
            ~~~~~~~~~~~^^^^^^
KeyError: 'model.layers.0.mlp.down_proj.B'
qeternity commented 7 months ago

Which model did you test? I've been running an SGLang Marlin branch since kernels were merged to vLLM.

Can you try one of my Marlin models, for example: https://huggingface.co/qeternity/Nous-Hermes-2-Mistral-7B-DPO-GPTQ-4bit-128g-actorder_False-Marlin

Qubitium commented 7 months ago

@qeternity Confirmed this is a compat bug when Marlin quantize is made with autogptq. @liurl21 will push an autogpt marlin quant PR within the hour to fix this. There appears be to two different methods of using quant_method. Not sure which is the "standard" but we will push this PR soon.

Qubitium commented 7 months ago

PR #290 created which fixed this compat issue.

AutoGPTQ actually has broken Marlin direct quantize support. Use my pending PR https://github.com/AutoGPTQ/AutoGPTQ/pull/586 to quant with it.