Closed hahmad2008 closed 1 month ago
Thanks for reporting the errors.
--max-prefill 8192
when you launch the server? i.e.
python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B-Instruct --host 0.0.0.0 --port 8000 --context-length 4096 --dtype bfloat16 --max-prefill 8192
If it still OOMs, try to add --max-prefill 4096
or --mem-fraction-static 0.83
@merrymercy thanks for the prompt response. it works with --max-prefill 4096
. btw is the backend VLLM? what are the available backends?
for tokenizer, how it should be if i didn't pass the chat-template, the LLM is loaded without chat template as following:
chat_template=None,
server_args=ServerArgs(model_path='NousResearch/Meta-Llama-3-8B-Instruct', tokenizer_path='NousResearch/Meta-Llama-3-8B-Instruct', tokenizer_mode='auto', load_format='auto', dtype='bfloat16', trust_remote_code=False, context_length=4096, quantization=None, served_model_name='NousResearch/Meta-Llama-3-8B-Instruct', chat_template=None, host='0.0.0.0', port=8000, additional_ports=[8001, 8002, 8003, 8004], mem_fraction_static=0.88, max_prefill_tokens=4096, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=746231706, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key='', file_storage_pth='SGlang_storage', dp_size=1, load_balance_method='round_robin', chunked_prefill_size=None, disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_disk_cache=False, enable_torch_compile=False, enable_p2p_check=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None)
btw is there any factors that influence the concurrent requests should I check?
We use the chat template in the hugging face tokenizer by default if it is available https://github.com/sgl-project/sglang/blob/8207637029082563cab74951fe8d5f86b574b85e/python/sglang/srt/openai_api/adapter.py#L701-L70, so you do not need to specify that chat template.
SGLang is a standalone server. It uses attention CUDA kernels from flashinfer, linear layers from vLLM, and triton operators from torch.compile. For your 8B model, you can try to add --enable-torch-compile
, which will be faster. It is not very precise to say that "the backend is vLLM", as SGLang integrates many high-performance components.
See this guide on tuning the parameters for high throughput https://github.com/sgl-project/sglang/blob/main/docs/en/hyperparameter_tuning.md
There is definitely a much bigger regression with recent versions around OOM. Previous workloads and configurations that had no issues running large concurrency for long periods of time are now OOM'ing even when running with a fraction of static mem pool and lowered concurrency.
There is definitely a much bigger regression with recent versions around OOM. Previous workloads and configurations that had no issues running large concurrency for long periods of time are now OOM'ing even when running with a fraction of static mem pool and lowered concurrency.
Hello, is there any solution now
This commit seems to have improved things (too early to say if resolved, but early testing is working better): https://github.com/sgl-project/sglang/commit/df191254abc002b3284560d9c4b94214a4656265
Checklist
Describe the bug
I am trying to benchmark inference of llama3-8b with long requests, I send 20 concurrent requests each with length of 1k tokens and I set the stream to True and max_tokens to 1024.
This is how I start the server:
python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B-Instruct --host 0.0.0.0 --port 8000 --context-length 4096 --dtype bfloat16 --chat-template llama-3
I add llama-3 template in conversaitons.py as they are present in conversions file of FASTCHAT.
Error:
Reproduction
Same as describe the bug section
Environment