Open Eutenacity opened 7 months ago
When I go deeper into the vllm, i find that the blank mainly comes from the ray worker in vllm/engine/llm_engine.py
So I want to know what is the ray worker used for?
Ray worker is used to host workers on different GPU for tensor parallel inference. The gap you are observing is probably kernel launch overhead. Which version are you using here? Newer vLLM version has enabled CUDA graph capture which reduces these overhead.
Ray worker is used to host workers on different GPU for tensor parallel inference. The gap you are observing is probably kernel launch overhead. Which version are you using here? Newer vLLM version has enabled CUDA graph capture which reduces these overhead.
0.3.0
I use torch.profiler.profile() to profile mixtral based on vllm. And I found lots of blank before each runing step.
When i try to compare the time cost of vllm with that of tensorrt-llm. I found that tensorrt-llm is 1.5X faster than vllm. But by comparing the time cost of each component, including the attention, experts, all reduce. vllm and tensorrt-llm perform nearly the same.
So I suppose that the blank before each runing step in vllm results in the slower perfomance. But I can found nothing to understand the occur of the blank.
Can you give me some help?
Here is the code to profile the mixtral