sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sglang.readthedocs.io/en/latest/
Apache License 2.0
5.11k stars 358 forks source link

[Bug] pt_main_thread uses 100% cpu all the time #955

Open wizd opened 1 month ago

wizd commented 1 month ago

Checklist

Describe the bug

the output of top command:

top - 10:27:25 up 33 min,  1 user,  load average: 1.22, 1.24, 1.16
Tasks: 730 total,   2 running, 728 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.2 us,  0.2 sy,  0.0 ni, 96.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 192964.5 total, 144154.6 free,  13737.5 used,  35072.4 buff/cache
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used. 177252.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  61758 root      20   0   34.5g   1.5g 382668 R 101.0   0.8   8:53.40 pt_main_thread
   1500 root      20   0 7779860 232140  66836 S   1.0   0.1   0:59.39 dockerd
  16283 super     20   0   54.1g 256304  55712 S   1.0   0.1   0:36.18 node
  16943 super     20   0   54.2g 286160  56024 S   1.0   0.1   0:31.39 node
  58325 7474      20   0   45.9g 956696  25720 S   1.0   0.5   0:24.76 java
  60533 super     20   0   71.6g 363292  43680 S   0.7   0.2   0:10.00 node
      1 root      20   0  167160  12316   8132 S   0.3   0.0   0:04.91 systemd
     14 root      20   0       0      0      0 I   0.3   0.0   0:01.87 rcu_sched
   1259 systemd+  20   0   26068  14044   9412 S   0.3   0.0   0:07.33 systemd-resolve
...

(base) super@hot:~/apps/sglang$ ps -auex | grep 61758
root       61758  100  0.7 36159428 1575172 ?    Rl   10:18   9:27 python3 -m sglang.launch_server --model-path /models/Llama3.1-8B-Chinese-Chat --host 0.0.0.0 --port 5011 --quantization fp8 --context-length 48000 --tokenizer-mode auto --mem-fraction-static 0.8 --disable-radix-cache

Reproduction

  sglang:
    image: sglang
    runtime: nvidia
    ports:
      - '5011:5011'
    volumes:
      - $DATA_ROOT/huggingface_cache:/root/.cache/huggingface
      - $DATA_ROOT/models:/models
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - HF_TOKEN=hf_kcyu...
    command: >
      python3 -m sglang.launch_server
      --model-path /models/Llama3.1-8B-Chinese-Chat
      --host 0.0.0.0
      --port 5011
      --quantization fp8
      --context-length 48000
      --tokenizer-mode auto
      --mem-fraction-static 0.8
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
    ipc: host
    restart: unless-stopped

Environment

Ubuntu 22.04 AMD64
merrymercy commented 1 month ago

It seems to be expected because the server uses a busy waiting loop to wait for new requests. Does this create any serious trouble on your side?

wizd commented 1 month ago

I agree. In today's world, a single CPU core is not particularly significant. However, for those who are unfortunate enough to have an Intel chip, it's a completely different story: https://www.reddit.com/r/intel/comments/1egthzw/megathread_for_intel_core_13th_14th_gen_cpu/

drtonyr commented 4 weeks ago

Unfortunately not just sglang, I'm seeing this with my own Pytorch code.

wjtgoo commented 3 weeks ago

Unfortunately not just sglang, I'm seeing this with my own Pytorch code.

+1

xbl916 commented 1 week ago

这似乎是意料之中的事,因为服务器使用忙等待循环来等待新请求。这会给您带来严重麻烦吗?

Given that I still use an i5-8400, which only has six cores, and I’m running LLM with dual GPUs, two of those cores are often running at 100%, which can have an impact.

elliotgao commented 6 days ago

Design-wise, i think, adding a sleep 0.001 sec or 0.01 sec should help. Otherwise, there would be tens of thousands of "trys" within 1 sec, which is unnecessarily bad.

xbl916 commented 5 days ago

Adding time.sleep(0.001) around line 910 in the /opt/conda/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py file can significantly reduce CPU usage. Note that the sleep duration should not be too long, as it may affect the inference speed. I set it to 0.001, and the impact on inference is barely noticeable. Refer to this example:

def run_tp_server(
    gpu_id: int,
    tp_rank: int,
    server_args: ServerArgs,
    nccl_port: int,
    model_override_args: dict,
):
    """Run a tensor parallel model server."""
    configure_logger(server_args, prefix=f" TP{tp_rank}")

    try:
        model_server = ModelTpServer(
            gpu_id,
            tp_rank,
            server_args,
            nccl_port,
            model_override_args,
        )
        tp_cpu_group = model_server.model_runner.tp_group.cpu_group

        while True:
            recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
            model_server.exposed_step(recv_reqs)
            time.sleep(0.001)
    except Exception:
        logger.error("Exception in run_tp_server:\n" + get_exception_traceback())
        raise