vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.22k stars 4.18k forks source link

[Performance]: phi 3.5 vision model consuming high CPU RAM and the process getting killed #9190

Open kuladeephx opened 2 days ago

kuladeephx commented 2 days ago

Proposal to improve performance

I am trying to run phi3.5 vision instruct model with around 10k prompts. What I noticed with the increase in prompts my CPU RAM consumption keeps increasing and eventually the process gets killed. Its running fine for say small sample like 1000 prompts. My system configuration is 48 GB VRAM and 64GB CPU RAM. Noticed a similar pattern with PIXTRAL-12B-2409. Has anyone faced this issue?

I have tried the implementation by passing in batches of 1000 to llm.generate but still the CPU RAM keeps increasing Below is the code implementation: Ima using two images per prompt from vllm import LLM, SamplingParams llm = LLM( model="microsoft/Phi-3.5-vision-instruct", gpu_memory_utilization=0.7, trust_remote_code=True, max_model_len=4096, limit_mm_per_prompt={"image": 4}, ) sampling_params = SamplingParams(max_tokens=100, temperature=0.0) outputs = llm.generate(prompt_list, sampling_params=sampling_params)

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

DarkLight1337 commented 2 days ago

It was fixed just earlier today by #9160. Can you try pulling the latest main branch?

kuladeephx commented 2 days ago

I have tried with the latest main branch, its still the same

DarkLight1337 commented 2 days ago

cc @Isotr0py

Isotr0py commented 2 days ago

@kuladeephx I noticed you didn't set max_num_seqs when initialized LLM. Can you check if setting max_num_seqs=2 would solve this?

kuladeephx commented 2 days ago

Have tried now with max_num_seqs=2, still the same

Isotr0py commented 2 days ago

Does this issue only encounter on phi-3.5-vision or encounter on other VLMs? I'm reproducing this issue with below commands, but it would cost some time for me to run 10k prompts on T4:

python examples/offline_inference_vision_language.py -m phi3_v --num-prompts 10000
kuladeephx commented 2 days ago

Thanks for the try, so the above trail does it apply 10000 prompts with no repeated images? yes noticed similar pattern with mistralai/Pixtral-12B-2409

Isotr0py commented 2 days ago

Can you check if other models like Qwen/Qwen2-VL-2B-Instruct and Qwen/Qwen2-VL-7B-Instruct also have this issue?

I'm not sure if this issue relates to composite weight loading, because we have fixed the RAM leaking bug on it previously. We can have a check on Qwen2-VL models since it doesn't use composite weight loading.

kuladeephx commented 2 days ago

Sure will try, was there any reference previously of people trying these above mentioned models with say 10000 prompts or higher. Just curious if anyone has tried in a similar scale

kuladeephx commented 1 day ago

@Isotr0py, can you pls let me know, were you able to reproduce the issue with 10000 prompts

Isotr0py commented 1 day ago

No, I can't reproduce this issue with 10k prompts. I didn't observe significant CPU RAM usage increase during the inference. In fact, my CPU RAM usage is mostly about 25% (about 4GB on my 16GB RAM device) all the time after loading the model.

Can you provide a full script that can reproduce this issue?

Isotr0py commented 1 day ago

The memory leak issue is a little bit similar to #8629 since it also occurred after long running.

Can you check if the solution provided by https://github.com/vllm-project/vllm/issues/8629#issuecomment-2363312483 work?

kuladeephx commented 1 day ago

@Isotr0py , I have tried a similar implementation of clearing GPU cache and garbage collection earlier, but it doesn't solve the issue. I am attaching the script and sample data(dummy_image_data.zip), you can try the experiment with total_no_of_prompts = 300 vs total_no_of_prompts = 600 and see the difference. Additional packages which might be required are requests, memory_profiler. I am attaching memory profile screenshots for total_no_of_prompts = 300 vs 600 case. Experiment is carried by passing 150 prompts at once to the LLM class

Link for script -https://github.com/kuladeephx/sample/blob/main/multimodal_sample.py Link for data - https://github.com/kuladeephx/sample/blob/main/dummy_image_data.zip

300_prompts 600_prompts
Isotr0py commented 1 day ago

Thanks, I can reproduce this issue with the provided script. Let me use vLLM's profiler to see what's wrong.

kuladeephx commented 21 hours ago

@Isotr0py, any lead on the issue?

Isotr0py commented 21 hours ago

I guess it's the Phi-3-Vision's image embedding causing the memory leak. I observed a high memory allocation on CPU about aten::cat (which is widely used in phi3v image embedding as well): 20241011131111

kuladeephx commented 20 hours ago

Have noticed a similar memory increase with pixtral-12B-2409 as well, so wondering if this has to do with vllm

Isotr0py commented 17 hours ago

I also profiled the phi3_v forwarding, seems that the memory leak not related to the model forwarding (the memory increasement from model forwarding only occurs on first time prefill).

The profiling log: phi3v_forward_profile.log

I'm afraid that the source of memory leak comes from other parts of llm_engine instead :( (I also noticed similar memory increase on other models besides phi3_v and pixtral)

kuladeephx commented 16 hours ago

Thanks for analysis, In that case this seems to affect multimodal models? Do you think the issue can be identified?

And what's the difference between LLM.generate and LLM.chat (excluding the use case of usage as a chat model), I tried the same set with LLM.chat and noticed memory consumption is less compared using LLM.generate

Isotr0py commented 11 hours ago

At least in my test, other multimodal models like InternVL2 also encountered this issue. For example, here is the memory profile with InternVL2-1B: internvl

Since there is no memory leak from the model forwarding, the source of leaking memory might be various (scheduler or block_manager etc). I'm afraid that I can't help with this, because I'm not very familiar with these implementations.

And what's the difference between LLM.generate and LLM.chat (excluding the use case of usage as a chat model), I tried the same set with LLM.chat and noticed memory consumption is less compared using LLM.generate

This is interesting, because LLM.chat just applies chat_template before calling LLM.generate.

kuladeephx commented 8 hours ago

Thanks, do you think is there anyone else who can help with this issue?