Open lxww302 opened 1 month ago
When the workload changed to "input2048output1" or "input16output2048", no errors occured.
the tracestack is as follows:
Exception in ControllerSingle:
Traceback (most recent call last):
File "/opt/tiger/sglang_src/python/sglang/srt/managers/controller_single.py", line 166, in start_controller_process
controller.loop_for_forward()
File "/opt/tiger/sglang_src/python/sglang/srt/managers/controller_single.py", line 103, in loop_for_forward
out_pyobjs = self.tp_server.exposed_step(recv_reqs)
File "/opt/tiger/sglang_src/python/sglang/srt/managers/tp_worker.py", line 222, in exposed_step
self.forward_step()
File "/home/tiger/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/opt/tiger/sglang_src/python/sglang/srt/managers/tp_worker.py", line 251, in forward_step
self.forward_decode_batch(self.running_batch)
File "/opt/tiger/sglang_src/python/sglang/srt/managers/tp_worker.py", line 612, in forward_decode_batch
next_token_ids = batch.sample(output.next_token_logits)
File "/opt/tiger/sglang_src/python/sglang/srt/managers/schedule_batch.py", line 760, in sample
batch_next_token_ids, success = top_k_top_p_sampling_from_probs(
File "/home/tiger/.local/lib/python3.9/site-packages/flashinfer/sampling.py", line 483, in top_k_top_p_sampling_from_probs
renorm_probs = top_k_renorm_prob(probs, top_k, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/flashinfer/sampling.py", line 557, in top_k_renorm_prob
return _kernels.top_k_renorm_prob(probs, *_to_tensor_scalar_tuple(top_k), eps)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU 0 has a total capacity of 79.11 GiB of which 862.56 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 74.94 GiB is allocated by PyTorch, with 318.62 MiB allocated in private pools (e.g., CUDA Graphs), and 1.98 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Checklist
Describe the bug
We have pretrained a 7B model using MQA(num_key_value_heads=1), when I do throughput benchmarking by modifying the config of meta-llama-3, setting num_key_value_heads=1. The service collapse when receiving workloads with input_len=output_len=1024.
Reproduction
serving:
python3 -m sglang.launch_server --model-path /models/dummy --disable-radix-cache
where/models/dummy
is simply copied from mistral-7b-instruct-v0.2 and setnum_key_value_heads=1
inconfig.json
benchmarking:
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4000 --random-input 1024 --random-output 1024 --output-file offline.jsonl
Environment