[Bug] cuda out of memory when using MQA and input_len=output_len=1024

lxww302 commented 1 month ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.

Describe the bug

We have pretrained a 7B model using MQA(num_key_value_heads=1), when I do throughput benchmarking by modifying the config of meta-llama-3, setting num_key_value_heads=1. The service collapse when receiving workloads with input_len=output_len=1024.

Reproduction

serving: python3 -m sglang.launch_server --model-path /models/dummy --disable-radix-cache where /models/dummy is simply copied from mistral-7b-instruct-v0.2 and set num_key_value_heads=1 in config.json

benchmarking: python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4000 --random-input 1024 --random-output 1024 --output-file offline.jsonl

Environment

Python: 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.140
CUDA Driver Version: 535.129.03
PyTorch: 2.4.0+cu121
sglang: 0.2.12
flashinfer: 0.1.4+cu121torch2.4
triton: 3.0.0
transformers: 4.45.0.dev0
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.9.5
fastapi: 0.112.0
hf_transfer: 0.1.8
huggingface_hub: 0.24.5
interegular: 0.3.3
packaging: 24.0
PIL: 10.3.0
psutil: 5.9.8
pydantic: 2.8.2
uvicorn: 0.30.5
uvloop: 0.19.0
zmq: 26.1.0
vllm: 0.5.4
multipart: 0.0.9
openai: 1.40.5
anthropic: 0.32.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    0-51,104-155    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    0-51,104-155    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    0-51,104-155    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    0-51,104-155    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    52-103,156-207  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    52-103,156-207  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    52-103,156-207  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      52-103,156-207  1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1024768

lxww302 commented 1 month ago

When the workload changed to "input2048output1" or "input16output2048", no errors occured.

lxww302 commented 1 month ago

the tracestack is as follows:

Exception in ControllerSingle:
Traceback (most recent call last):
  File "/opt/tiger/sglang_src/python/sglang/srt/managers/controller_single.py", line 166, in start_controller_process
    controller.loop_for_forward()
  File "/opt/tiger/sglang_src/python/sglang/srt/managers/controller_single.py", line 103, in loop_for_forward
    out_pyobjs = self.tp_server.exposed_step(recv_reqs)
  File "/opt/tiger/sglang_src/python/sglang/srt/managers/tp_worker.py", line 222, in exposed_step
    self.forward_step()
  File "/home/tiger/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/tiger/sglang_src/python/sglang/srt/managers/tp_worker.py", line 251, in forward_step
    self.forward_decode_batch(self.running_batch)
  File "/opt/tiger/sglang_src/python/sglang/srt/managers/tp_worker.py", line 612, in forward_decode_batch
    next_token_ids = batch.sample(output.next_token_logits)
  File "/opt/tiger/sglang_src/python/sglang/srt/managers/schedule_batch.py", line 760, in sample
    batch_next_token_ids, success = top_k_top_p_sampling_from_probs(
  File "/home/tiger/.local/lib/python3.9/site-packages/flashinfer/sampling.py", line 483, in top_k_top_p_sampling_from_probs
    renorm_probs = top_k_renorm_prob(probs, top_k, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/flashinfer/sampling.py", line 557, in top_k_renorm_prob
    return _kernels.top_k_renorm_prob(probs, *_to_tensor_scalar_tuple(top_k), eps)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU 0 has a total capacity of 79.11 GiB of which 862.56 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 74.94 GiB is allocated by PyTorch, with 318.62 MiB allocated in private pools (e.g., CUDA Graphs), and 1.98 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

zhyncs commented 4 weeks ago

ref https://github.com/sgl-project/sglang/blob/main/docs/en/hyperparameter_tuning.md

sgl-project / sglang