Open wizd opened 1 month ago
It seems to be expected because the server uses a busy waiting loop to wait for new requests. Does this create any serious trouble on your side?
I agree. In today's world, a single CPU core is not particularly significant. However, for those who are unfortunate enough to have an Intel chip, it's a completely different story: https://www.reddit.com/r/intel/comments/1egthzw/megathread_for_intel_core_13th_14th_gen_cpu/
Unfortunately not just sglang, I'm seeing this with my own Pytorch code.
Unfortunately not just sglang, I'm seeing this with my own Pytorch code.
+1
这似乎是意料之中的事,因为服务器使用忙等待循环来等待新请求。这会给您带来严重麻烦吗?
Given that I still use an i5-8400, which only has six cores, and I’m running LLM with dual GPUs, two of those cores are often running at 100%, which can have an impact.
Design-wise, i think, adding a sleep 0.001 sec or 0.01 sec should help. Otherwise, there would be tens of thousands of "trys" within 1 sec, which is unnecessarily bad.
Adding time.sleep(0.001) around line 910 in the /opt/conda/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py file can significantly reduce CPU usage. Note that the sleep duration should not be too long, as it may affect the inference speed. I set it to 0.001, and the impact on inference is barely noticeable. Refer to this example:
def run_tp_server(
gpu_id: int,
tp_rank: int,
server_args: ServerArgs,
nccl_port: int,
model_override_args: dict,
):
"""Run a tensor parallel model server."""
configure_logger(server_args, prefix=f" TP{tp_rank}")
try:
model_server = ModelTpServer(
gpu_id,
tp_rank,
server_args,
nccl_port,
model_override_args,
)
tp_cpu_group = model_server.model_runner.tp_group.cpu_group
while True:
recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
model_server.exposed_step(recv_reqs)
time.sleep(0.001)
except Exception:
logger.error("Exception in run_tp_server:\n" + get_exception_traceback())
raise
Checklist
Describe the bug
the output of top command:
Reproduction
Environment