Open remixer-dec opened 1 month ago
It should be easy. Could you give us an example command you want us to have?
python -m sglang.launch_server --model-path /path/to/model.gguf
@remixer-dec @merrymercy
How to serve a gguf model? VLLM provided a tutorial for that like this https://docs.vllm.ai/en/latest/getting_started/examples/gguf_inference.html
But how to inference with sglang? @merrymercy could you please provide a command to do that?
example for this model: TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
python -m sglang.launch_server --model-path /path/to/model.gguf
If you haven't tried it, please don't reply.. This doesn't work at all.
python -m sglang.launch_server --model-path /home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf --port 30000 --mem-fraction-static 0.8
WARNING 11-08 15:09:00 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
[2024-11-08 15:09:08] server_args=ServerArgs(model_path='/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf', tokenizer_path='/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf', chat_template=None, is_embedding=False, host='127.0.0.1', port=30000, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=879353602, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)
WARNING 11-08 15:09:14 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 11-08 15:09:14 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
[2024-11-08 15:09:22] Traceback (most recent call last):
File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/utils/hub.py", line 403, in cached_file
resolved_file = hf_hub_download(
File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
return f(*args, **kwargs)
File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
validate_repo_id(arg_value)
File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf'. Use `repo_type` argument if needed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/sglang/srt/managers/detokenizer_manager.py", line 216, in run_detokenizer_process
manager = DetokenizerManager(server_args, port_args)
File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/sglang/srt/managers/detokenizer_manager.py", line 72, in __init__
self.tokenizer = get_tokenizer(
File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/sglang/srt/hf_transformers_utils.py", line 129, in get_tokenizer
tokenizer = AutoTokenizer.from_pretrained(
File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 844, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 676, in get_tokenizer_config
resolved_config_file = cached_file(
File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/utils/hub.py", line 469, in cached_file
raise EnvironmentError(
OSError: Incorrect path_or_model_id: '/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf'. Please provide either the path to a local folder or the repo_id of a model on the Hub.
@XYZliang maybe read the previous comment? It's not supposed to be working, I replied the command that I was asked for
@XYZliang maybe read the previous comment? It's not supposed to be working, I replied the command that I was asked for
Sorry, I didn't pay attention to the users who commented...
not work. error is:
(python311) whk@VM-2-13-ubuntu:~/code/qwen25-3b$ python -m sglang.launch_server --model-path Qwen2.5-3B-Instruct-q5_k_m.gguf --port 8075 --host 0.0.0.0 --mem-fraction-static 0.2 --chat-template template.json [2024-11-14 11:42:24] server_args=ServerArgs(model_path='Qwen2.5-3B-Instruct-q5_k_m.gguf', tokenizer_path='Qwen2.5-3B-Instruct-q5_k_m.gguf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='Qwen2.5-3B-Instruct-q5_k_m.gguf', chat_template='template.json', is_embedding=False, host='0.0.0.0', port=8075, mem_fraction_static=0.2, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=995653076, constrained_json_whitespace_pattern=None, watchdog_timeout=300, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False) Traceback (most recent call last): File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 668, in _get_config_dict config_dict = cls._dict_from_json_file(resolved_config_file) File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 771, in _dict_from_json_file text = reader.read() File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 8: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/launch_server.py", line 16, in
It should be easy to support. Contributions are welcome! Or you can convert that to HF format.
Checklist
Motivation
Hi! Since .gguf format is already supported by vLLM, is it be possible to add support for it in SGLang server?
Related resources
No response