sgl-project / sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
Apache License 2.0
2.75k stars 176 forks source link

Seems only GPU 0 is being used even when in tensor parallel across 2 GPUs #536

Closed aflah02 closed 2 weeks ago

aflah02 commented 2 weeks ago

Hi I was recently running Llama3-70B on a 2xH100 server. I noticed in the logs that all the messages only mentioned [gpu_id=0] and was wondering whether this means GPU 1 isn't being used at all to serve requests? When I check the gpu memory usage both are filled to the brim and have high utilization which seems to imply both are being used but then I'm not sure why are there no lines in the log for GPU 1.

image

Qubitium commented 2 weeks ago

@aflah02 I can check on this later but maybe the format is just wrong for these log entries. Decode/tokenizer are on a separate process separate from gpu inference so it has zero tie to gpu. Perhaps the code/log is forcing output gpu_id even when the log position is a cpu only task.

aflah02 commented 2 weeks ago

Thanks @Qubitium

Qubitium commented 2 weeks ago

@aflah02 The code is actually in the tp worker process where gpu work is done but the log print is gated by tp_rank==0 so even if you have gpu == 2 and tp == 2, the log would only print the id of the first one. Effectively, for these stat logs, you should ignore using gpu_id as judge for multi-gpu tp effectiveness.

https://github.com/sgl-project/sglang/blob/40e53d65cbb8b609a6ff8e977d2318044d0f0ee0/python/sglang/srt/managers/controller/tp_worker.py#L232-L250

aflah02 commented 2 weeks ago

Thank You! This clears it up