How to launch server with quantized model?

Gintasz commented 8 months ago

Sorry for a newb question, I don't find an answer. I succeeded in launching the server with unquantised Mistral7B:

python3 -m sglang.launch_server --model-path mistralai/Mistral-7B-Instruct-v0.2 --port 42069 --host 0.0.0.0

I'm trying to launch quantised model like this:

python3 -m sglang.launch_server --model-path TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True --port 42069 --host 0.0.0.0

I get error:

root@C.10082074:~$ python3 -m sglang.launch_server --model-path TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True --port 42069 --host 0.0.0.0
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file
    resolved_file = hf_hub_download(
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 164, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True'.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/sglang/launch_server.py", line 11, in <module>
    launch_server(server_args, None)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/server.py", line 430, in launch_server
    tokenizer_manager = TokenizerManager(server_args, port_args)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/tokenizer_manager.py", line 93, in __init__
    self.hf_config = get_config(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/hf_transformers_utils.py", line 33, in get_config
    config = AutoConfig.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1111, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 633, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 688, in _get_config_dict
    resolved_config_file = cached_file(
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 462, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: 'TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

I tried to download the repository locally and then specify directory as path:

git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Mistral-7B-v0.1-GPTQ
python3 -m sglang.launch_server --model-path Mistral-7B-v0.1-GPTQ --port 42069 --host 0.0.0.0

But then I get

Rank 0: load weight begin.
quant_config: GPTQConfig(weight_bits=4, group_size=32, desc_act=True)
Process Process-1:
Traceback (most recent call last):
router init state: Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process
    model_client = ModelRpcClient(server_args, port_args)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 606, in __init__
    self.model_server.exposed_init_model(0, server_args, port_args)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 62, in exposed_init_model
    self.model_runner = ModelRunner(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 275, in __init__
    self.load_model()
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 308, in load_model
    model.load_weights(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 290, in load_weights
    for name, loaded_weight in hf_model_weights_iterator(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/weight_utils.py", line 251, in hf_model_weights_iterator
    with safe_open(st_file, framework="pt") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge

detoken init state: init ok

KMouratidis commented 8 months ago

I also cannot use Mixtral with GPTQ and AWQ. The errors I typically get are (respectively):

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

and

/home/runner/work/vllm/vllm/csrc/quantization/awq/gemm_kernels.cu:46: void vllm::awq::gemm_forward_4bit_cuda_m16nXk32(int, int, __half *, int *, __half *, int *, int, int, int, __half *) [with int N = 128]: block: [25,0,0], thread: [18,1,0] Assertion `false` failed.

Here are the quantized models I tried:

TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ (works fine on vLLM)
TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
casperhansen/mixtral-instruct-awq

Gintasz commented 8 months ago

@KMouratidis can you write a full list of commands how did you pull the weights of the mixtral model and passed them to sglang?

KMouratidis commented 8 months ago

@KMouratidis can you write a full list of commands how did you pull the weights of the mixtral model and passed them to sglang?

I didn't, the server automatically downloaded the model for me when launched. The command I used was: python3 -m sglang.launch_server --model-path $MODEL --tp-size 8 --mem-fraction-static 0.85

Gintasz commented 8 months ago

@KMouratidis yeah and what was your $MODEL value? Because I tried like this below and I got model path name error, as in original post:

python3 -m sglang.launch_server --model-path TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True --port 42069 --host 0.0.0.0

KMouratidis commented 8 months ago

Any one of the 3 I mentioned in my first comment.

bradfox2 commented 7 months ago

Mixtral AWQ works with https://huggingface.co/casperhansen/mixtral-instruct-awq on A100 80G

ValeKnappich commented 7 months ago

Sorry for a newb question, I don't find an answer. I succeeded in launching the server with unquantised Mistral7B:

python3 -m sglang.launch_server --model-path mistralai/Mistral-7B-Instruct-v0.2 --port 42069 --host 0.0.0.0

I'm trying to launch quantised model like this:

python3 -m sglang.launch_server --model-path TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True --port 42069 --host 0.0.0.0

I get error:


root@C.10082074:~$ python3 -m sglang.launch_server --model-path TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True --port 42069 --host 0.0.0.0
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file
    resolved_file = hf_hub_download(
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 164, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True'.

Specifying the revision like this won't work. The revision is set to None in the model loading code of sglang.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.10/site-packages/sglang/launch_server.py", line 11, in launch_server(server_args, None) File "/opt/conda/lib/python3.10/site-packages/sglang/srt/server.py", line 430, in launch_server tokenizer_manager = TokenizerManager(server_args, port_args) File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/tokenizer_manager.py", line 93, in init self.hf_config = get_config( File "/opt/conda/lib/python3.10/site-packages/sglang/srt/hf_transformers_utils.py", line 33, in get_config config = AutoConfig.from_pretrained( File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1111, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 633, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 688, in _get_config_dict resolved_config_file = cached_file( File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 462, in cached_file raise EnvironmentError( OSError: Incorrect path_or_model_id: 'TheBloke/Mistral-7B-v0.1-GPTQ:gptq-4bit-32g-actorder_True'. Please provide either the path to a local folder or the repo_id of a model on the Hub.
I tried to download the repository locally and then specify directory as path:
git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Mistral-7B-v0.1-GPTQ python3 -m sglang.launch_server --model-path Mistral-7B-v0.1-GPTQ --port 42069 --host 0.0.0.0
But then I get
Rank 0: load weight begin. quant_config: GPTQConfig(weight_bits=4, group_size=32, desc_act=True) Process Process-1: Traceback (most recent call last): router init state: Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process model_client = ModelRpcClient(server_args, port_args) File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 606, in init self.model_server.exposed_init_model(0, server_args, port_args) File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 62, in exposed_init_model self.model_runner = ModelRunner( File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 275, in init self.load_model() File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 308, in load_model model.load_weights( File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 290, in load_weights for name, loaded_weight in hf_model_weights_iterator( File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/weight_utils.py", line 251, in hf_model_weights_iterator with safe_open(st_file, framework="pt") as f: safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge

detoken init state: init ok

Cloning the model repo with the desired revision and loading the model from the local path works fine on my end with the exact same command you provided (sglang 0.1.13)

davideuler commented 7 months ago

sglang works with Mixtral-8x7B-Instruct-v0.1-GPTQ on A100 in my test.

python -m sglang.launch_server --model-path  ./Mixtral-8x7B-Instruct-v0.1-GPTQ/ --mem-fraction-static 0.9 --port 8501 --host 0.0.0.0 --context-length 8192 --trust-remote-code

github-actions[bot] commented 3 months ago

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

sgl-project / sglang

How to launch server with quantized model? #302