sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.07k stars 506 forks source link

[Feature] GGUF support #1616

Open remixer-dec opened 1 month ago

remixer-dec commented 1 month ago

Checklist

Motivation

Hi! Since .gguf format is already supported by vLLM, is it be possible to add support for it in SGLang server?

Related resources

No response

merrymercy commented 1 month ago

It should be easy. Could you give us an example command you want us to have?

remixer-dec commented 1 month ago

python -m sglang.launch_server --model-path /path/to/model.gguf

hahmad2008 commented 1 week ago

@remixer-dec @merrymercy
How to serve a gguf model? VLLM provided a tutorial for that like this https://docs.vllm.ai/en/latest/getting_started/examples/gguf_inference.html

But how to inference with sglang? @merrymercy could you please provide a command to do that? example for this model: TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF

XYZliang commented 1 week ago

python -m sglang.launch_server --model-path /path/to/model.gguf

If you haven't tried it, please don't reply.. This doesn't work at all.

python -m sglang.launch_server --model-path /home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf --port 30000 --mem-fraction-static 0.8
WARNING 11-08 15:09:00 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
[2024-11-08 15:09:08] server_args=ServerArgs(model_path='/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf', tokenizer_path='/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf', chat_template=None, is_embedding=False, host='127.0.0.1', port=30000, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=879353602, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)
WARNING 11-08 15:09:14 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 11-08 15:09:14 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
[2024-11-08 15:09:22] Traceback (most recent call last):
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/utils/hub.py", line 403, in cached_file
    resolved_file = hf_hub_download(
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf'. Use `repo_type` argument if needed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/sglang/srt/managers/detokenizer_manager.py", line 216, in run_detokenizer_process
    manager = DetokenizerManager(server_args, port_args)
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/sglang/srt/managers/detokenizer_manager.py", line 72, in __init__
    self.tokenizer = get_tokenizer(
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/sglang/srt/hf_transformers_utils.py", line 129, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 844, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 676, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/utils/hub.py", line 469, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: '/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf'. Please provide either the path to a local folder or the repo_id of a model on the Hub.
remixer-dec commented 1 week ago

@XYZliang maybe read the previous comment? It's not supposed to be working, I replied the command that I was asked for

XYZliang commented 1 week ago

@XYZliang maybe read the previous comment? It's not supposed to be working, I replied the command that I was asked for

Sorry, I didn't pay attention to the users who commented...

whk6688 commented 2 days ago

not work. error is:

(python311) whk@VM-2-13-ubuntu:~/code/qwen25-3b$ python -m sglang.launch_server --model-path Qwen2.5-3B-Instruct-q5_k_m.gguf --port 8075 --host 0.0.0.0 --mem-fraction-static 0.2 --chat-template template.json [2024-11-14 11:42:24] server_args=ServerArgs(model_path='Qwen2.5-3B-Instruct-q5_k_m.gguf', tokenizer_path='Qwen2.5-3B-Instruct-q5_k_m.gguf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='Qwen2.5-3B-Instruct-q5_k_m.gguf', chat_template='template.json', is_embedding=False, host='0.0.0.0', port=8075, mem_fraction_static=0.2, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=995653076, constrained_json_whitespace_pattern=None, watchdog_timeout=300, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False) Traceback (most recent call last): File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 668, in _get_config_dict config_dict = cls._dict_from_json_file(resolved_config_file) File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 771, in _dict_from_json_file text = reader.read() File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 8: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/launch_server.py", line 16, in raise e File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/launch_server.py", line 14, in launch_server(server_args) File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/server.py", line 457, in launch_server launch_engine(server_args=server_args) File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/server.py", line 429, in launch_engine tokenizer_manager = TokenizerManager(server_args, port_args) File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/managers/tokenizer_manager.py", line 103, in init self.model_config = ModelConfig( File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/configs/model_config.py", line 46, in init self.hf_config = get_config( File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/hf_transformers_utils.py", line 66, in get_config config = AutoConfig.from_pretrained( File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1017, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, kwargs) File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 574, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, kwargs) File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 672, in _get_config_dict raise EnvironmentError( OSError: It looks like the config file at 'Qwen2.5-3B-Instruct-q5_k_m.gguf' is not a valid JSON file.

merrymercy commented 1 day ago

It should be easy to support. Contributions are welcome! Or you can convert that to HF format.