[Bug]: Simultaneous mm calls lead to permanently degraded performance.

SeanIsYoung commented 1 month ago

Your current environment

The output of `python collect_env.py`

```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) - WSL GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.35 Python version: 3.11.10 (main, Oct 3 2024, 07:29:13) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 11.5.119 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4080 SUPER Nvidia driver version: 555.99 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 28 On-line CPU(s) list: 0-27 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-14700KF CPU family: 6 Model: 183 Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 1 Stepping: 1 BogoMIPS: 6835.20 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 672 KiB (14 instances) L1i cache: 448 KiB (14 instances) L2 cache: 28 MiB (14 instances) L3 cache: 33 MiB (1 instance) Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.77 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.1 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.6.77 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.45.1 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.1.dev238+ge2c6e0a82 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Model Input Dumps

No response

🐛 Describe the bug

I don't know whether this is user error, or a bug. But I'm doing multimodal inference and I noticed I could have multiple instances of my script making calls to the model and each would get the same throughput, leading to a higher avg generation throughput. Having noticed this I modified my script to make it asynchronous. But when I run it the avg generation throughput is now worse than if I was just running one instance. And when I go back to running just a single synchronous instance I'm now getting even worse throughput. The only way to get it to return to it's original performance is to restart the model. I have however noticed that if the calls are still asynchronous, but have a staggered delay to them I have no problems.

To clarify with some rough numbers, the below are the results of running the different scripts against a single vllm server.	script setup	avg gen throughput (tokens/s)
A single synchronous instance	~50
3 manually started synchronous instances (running at the same time in different terminals)	~150
1 asynchronous instance making 3 calls at a time	~45
A single synchronous instance after running async	~15
Restarted vLLM Server	N/A
1 asynchronous instance making 3 staggered calls	~150

Looking at these it appears that something about the way I'm running the asynchronous version hits some sort of "resource limit" and so the amount of resources each call is assigned are limited (15 * 3 = 45). But when running again with a smaller load this "call limit" is still in place. Maybe?

Possibly supporting this theory is the fact that when I run the staggered version the vram usage doesn't get as high. But I don't know why the vram usage is lower because in the end it's still running 3 requests at once. And I don't have enough headroom to truly test it.

You can reproduce with with the below code. Just update the open ai connection to point it at your own model. You can achieve each of the different behaviours by adjusting the num_parallel and delay parameters.

Serving Command:

vllm serve $MODELS/openbmb/MiniCPM-V-2_6 --enforce-eager --max-model-len 4096 --max-seq-len 2048 --served-model-name hello --trust-remote-code --limit-mm-per-prompt image=2 --gpu-memory-utilization 0.95 --quantization fp8

Code:

import asyncio

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage
from langchain_core.runnables import Runnable

from langchain_openai import ChatOpenAI

def get_transcribe_chain() -> Runnable:
    chat = ChatOpenAI(model="hello", base_url="http://localhost:8000/v1", api_key="EMPTY", temperature=0.0)
    prompt_template = ChatPromptTemplate.from_messages([
        SystemMessage(content="You are an expert long format writer."),
        ( "user", [ 
            { "type": "image_url", "image_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } },
            { "type": "text", "text": "Write me a really really long and beautiful description of the image, it should be a least 4000 words." }
        ]),
    ])
    return prompt_template | chat

async def _transcribe_page( semaphore: asyncio.Semaphore, transcribe_chain: Runnable, delay: float) -> None:
    async with semaphore:
        # VLLM seems to complain if you hit it with too many at once.
        # So each of the pages is staggered.
        await asyncio.sleep(delay)
        await transcribe_chain.ainvoke(input={"input": "hi"})

async def _transcribe_pages_async( transcribe_chain: Runnable, num_parallel: int, delay: int) -> None:
    semaphore = asyncio.Semaphore(num_parallel)

    tasks = []
    for page_number in range(1, 101):
        tasks.append(_transcribe_page(
            semaphore=semaphore,
            transcribe_chain=transcribe_chain,
            delay=delay * (page_number % num_parallel),
        ))

    await asyncio.gather(*tasks)

def transcribe_pages( transcribe_chain: Runnable, num_parallel: int=2, delay=2) -> None:
    try:
        loop = asyncio.get_event_loop()
        if loop.is_closed():
            loop = asyncio.new_event_loop()
            asyncio.set_event_loop(loop)
    except RuntimeError:
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)

    loop.run_until_complete(_transcribe_pages_async(transcribe_chain, num_parallel, delay))

if __name__ == "__main__":
    # Set this to 1 for a synchronous solution and higher for async.
    num_parallel = 3

    # Set this to 0 for true async or provide a float for n seconds staggering.
    delay = 2

    while True:
        transcribe_pages(get_transcribe_chain(), num_parallel, delay)

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 commented 1 month ago

Thanks for reporting this! Can you profile the code and see where the bottleneck is occurring?

DarkLight1337 commented 1 month ago

cc @ywang96

SeanIsYoung commented 1 month ago

I think I've run the profiler. At least I followed the documentation here and found the start_profile and stop_profile routes.

Unffortunately I haven't been able to view the logs. I've tried both perfetto and tensorboard. perfetto keeps throwing a segmentation fault when I try to load the log file and tensorboard was throwing an error TypeError: MessageToJson() got an unexpected keyword argument 'including_default_value_fields'. I tried again after installing tensorboard from source and it just says No dashboards are active for the current data set..

I don't know if this is something I'm doing wrong, a bug with vllm or a bug with the analysis engines. Either way I can't actually view the traces.

If you want to have a look I've uploaded the traces here

DarkLight1337 commented 1 month ago

I'm able to read the logs, but since the logfile is so huge, it takes quite a while to load, and outright crashes when I try to view the details from stack trace. After installing via pip install torch_tb_profiler, I ran the command tensorboard --logdir=<directory_containing_the_json_file>.

SeanIsYoung commented 1 month ago

That worked, thanks.

You've probably already seen this then, but in case it's useful to anyone else:

Staggered / Baseline:

Async / Experimental:

DarkLight1337 commented 1 month ago

We also need the stack trace info since those graphs only keep track of pytorch operations. I suspect the overhead is in the vllm engine itself.

SeanIsYoung commented 1 month ago

Assuming I've done the right thing using VLLM_TRACE_FUNCTION=1, the stack traces have been uploaded as well, though I'm afraid I have no idea how to pull anything useful from them.

I've also had a play around with v0.6.3 and the problem is still largely still present (with both openbmb/MiniCPM-V-2_6 and Qwen/Qwen2-VL-7B-Instruct). However, this version seems to have introduced lower memory usage (at least in this usecase). And under the original conditions with 3 async calls the slow down is no longer present, but after increasing the number of async calls it eventually returns. Looking at the memory usage it appears that the slow down only appears when vllm starts making use of the shared gpu memory (system ram pretending it's vram). But as before, the curious thing is that staggering the calls uses significantly less ram, to the point that I actually run out of kv_cache before it can start using the shared memory.

DarkLight1337 commented 1 month ago

Also cc @youkaichao regarding this, I don't think this has anything to do with broadcasting though since TP/PP isn't being used here.

DarkLight1337 commented 1 month ago

Assuming I've done the right thing using VLLM_TRACE_FUNCTION=1, the stack traces have been uploaded as well, though I'm afraid I have no idea how to pull anything useful from them.

I've also had a play around with v0.6.3 and the problem is still largely still present _(with both openbmb/MiniCPM-V-26 and Qwen/Qwen2-VL-7B-Instruct). However, this version seems to have introduced lower memory usage (at least in this usecase). And under the original conditions with 3 async calls the slow down is no longer present, but after increasing the number of async calls it eventually returns. Looking at the memory usage it appears that the slow down only appears when vllm starts making use of the shared gpu memory (system ram pretending it's vram). But as before, the curious thing is that staggering the calls uses significantly less ram, to the point that I actually run out of kv_cache before it can start using the shared memory.

Sorry, I meant the trace information in the original logs via start_profile and stop_profile. Let me try extracting a subset of the logs and see if they can load on TensorBoard. (Seems that the "Trace" view is empty, I'm also quite new to this so not sure why this is the case...) Seems that even the trace information only includes PyTorch op/module calls, perhaps we should run a regular Python profiler instead.

SeanIsYoung commented 1 month ago

Had a go using py-spy, think that might be more what you were after. Uploaded the results with the rest. The actual svg files are interactive so they provide a bit more information than what's shown below. But the speedscope file's are probably more informative.

To look at the speedscope files you'll either need to install speedscope or use the web app

staggered async

DarkLight1337 commented 1 month ago

Thanks for providing info! This shows that sampling is taking much longer in the async case. Perhaps there is an issue in the multi-step scheduler? @comaniac @afeldman-nm are you available to look into this?

DarkLight1337 commented 1 month ago

@SeanIsYoung do you get a similar issue if you don't input any multimodal data? Our multimodal models should accept text-only input so you just have to leave out the images.

SeanIsYoung commented 1 month ago

@DarkLight1337 Using the same model and text only input I can't get it to do this. The async and staggered versions both appear to have similar memory usage.

I'm also unable to push it to the point that the GPU starts using it's shared memory, the KV cache runs out of space first limiting the number of active calls, and I have that set as high it will allow on my system.

comaniac commented 1 month ago

Based on the command I don't think multi-step scheduling is enabled, and AFAIK async output processor is disabled when enforcing eager mode. The huge sampling time in the profile may not be accurate, because we launched CUDA kernels in non-blocking fashion, and the sampler has a sync point (aka barrier).

For "3 manually started synchronous instances", did you launch 3 endpoints and adjust the GPU memory utilization to about 30% per instance, or you have 3 GPUs?

In the case that staggered version works, I have an impression that vLLM request queue is not that efficient (probably due to single thread asyncio event loop). So I implemented another queue outside vLLM to prevent my script from sending too many requests at once.

SeanIsYoung commented 1 month ago

For "3 manually started synchronous instances", did you launch 3 endpoints and adjust the GPU memory utilization to about 30% per instance, or you have 3 GPUs?

No, sorry if I didn't make that clear. All of them were run with only a single instance of the vLLM server running.

What I meant here was that I had a script that would iterate through a directory making calls one at a time to vLLM for each image (The script itself having no async or anything, just a for loop). I then opened another 2 terminal windows and started another two instances of the script pointing at different directories.

comaniac commented 1 month ago

I see, so 3 sync instances are actually having 3 processes sending requests, and each of them sends requests sequentially. A more common term for this use case is "concurrency". Then to me it's hard to make a comparison between your async and sync, because from the engine's point of view, the only difference is "QPS" (query per second). I suspect your issue comes from the overhead caused by too high QPS. One thing you can try:

Launch a server. Have 3 processes sending requests sequentially. Monitoring server log and observe Running: xx reqs and Pending: yy reqs.
Re-launch a server. Have 3 processes sending requests asynchronously. Monitoring server log and observe Running: aa reqs and Pending: bb reqs.

Then you can compare these values over time, where the number of running requests is the batch size, and the number of pending requests is the queue length. Basically we need to know which number changes significantly when moving from sync to async.

SeanIsYoung commented 1 month ago

Yeah you're right, in the end there's not much difference between the 3 sync, staggered async and async. As far as the server is concerned it still ends up running 3 requests concurrently. It's just whether or not it is passed the 3 requests all at the same time, or if they are staggered (as an aside the 3 sync is essentially the same as the staggered, it's just that the delay is the time it takes me to swap to a different terminal and start up the next one).

Consequently the only difference with the running group is that you might see it at 1 or 2 initially when the calls are staggered, but it will still eventually end up running the 3 requests concurrently, they just won't have started at the same time. That I've seen, the only time anything is put into the pending queue is when the kv cache is full, otherwise they go straight to the running group.

I don't know enough about how vLLM works under the hood to be certain. But I'm guessing that maybe there are a couple things combining here. Firstly, when I send the requests asynchronously / all at the same time, I'm guessing that all the preprocessing is done at the same time maybe, using the same shared resource space, and because they're done at the same time this resource space grows larger than when they are staggered. This guess comes from the fact that for the same number of requests the VRAM usage is higher when the calls are completely async than when they are staggered. And that by looking at speedscope it seems as though all the preprocessing is done in one go for the async situation whereas the staggered seems to have a separate preprocessing call for each request (it could be this is wrong due to monitoring an async process, but I don't know).

Staggered:

Async:

Having run it with 0.6.3 where memory usage is a bit lower I saw that I could run them completely asynchronously with no slow-down, so long as there few enough requests to keep memory requirements within the VRAM. But when those requests push the memory usage into the shared memory it starts to slow down (Which seems reasonable / plausible given that the RAM is slower and may require copying data between it and the GPU).

But when you go back to making fewer requests the slow-down is still present, indicating that either some flag has been set, or maybe now that the cache has expanded into the shared memory, it won't shrink back out of the shared memory.

But I could also be entirely wrong.

comaniac commented 1 month ago

So your "sync" is not really "sync"...it's really confusing.

Then what I can think of in summary is batch size 3 has lower throughput than batch size 1, because when you send requests with some delays (you called it "staggered"), vLLM prefills (and encodes images) one request at a time; while when you send requests at once (you called it "async"), vLLM processes them together (batch size = 3). If your requests have a long prompt length, this may be the case. For text models, we solved this problem with chunked prefill, but this is not merged yet for MM models.

SeanIsYoung commented 1 month ago

So your "sync" is not really "sync"...it's really confusing.

Sorry about that, I was talking in terms of the individual scripts.

Then what I can think of in summary is batch size 3 has lower throughput than batch size 1, because when you send requests with some delays (you called it "staggered"), vLLM prefills (and encodes images) one request at a time; while when you send requests at once (you called it "async"), vLLM processes them together (batch size = 3). If your requests have a long prompt length, this may be the case. For text models, we solved this problem with chunked prefill, but this is not merged yet for MM models.

I think this makes sense.

vllm-project / vllm