Open kuladeephx opened 2 days ago
It was fixed just earlier today by #9160. Can you try pulling the latest main
branch?
I have tried with the latest main branch, its still the same
cc @Isotr0py
@kuladeephx I noticed you didn't set max_num_seqs
when initialized LLM
. Can you check if setting max_num_seqs=2
would solve this?
Have tried now with max_num_seqs=2, still the same
Does this issue only encounter on phi-3.5-vision
or encounter on other VLMs? I'm reproducing this issue with below commands, but it would cost some time for me to run 10k prompts on T4:
python examples/offline_inference_vision_language.py -m phi3_v --num-prompts 10000
Thanks for the try, so the above trail does it apply 10000 prompts with no repeated images? yes noticed similar pattern with mistralai/Pixtral-12B-2409
Can you check if other models like Qwen/Qwen2-VL-2B-Instruct
and Qwen/Qwen2-VL-7B-Instruct
also have this issue?
I'm not sure if this issue relates to composite weight loading, because we have fixed the RAM leaking bug on it previously. We can have a check on Qwen2-VL models since it doesn't use composite weight loading.
Sure will try, was there any reference previously of people trying these above mentioned models with say 10000 prompts or higher. Just curious if anyone has tried in a similar scale
@Isotr0py, can you pls let me know, were you able to reproduce the issue with 10000 prompts
No, I can't reproduce this issue with 10k prompts. I didn't observe significant CPU RAM usage increase during the inference. In fact, my CPU RAM usage is mostly about 25% (about 4GB on my 16GB RAM device) all the time after loading the model.
Can you provide a full script that can reproduce this issue?
The memory leak issue is a little bit similar to #8629 since it also occurred after long running.
Can you check if the solution provided by https://github.com/vllm-project/vllm/issues/8629#issuecomment-2363312483 work?
@Isotr0py , I have tried a similar implementation of clearing GPU cache and garbage collection earlier, but it doesn't solve the issue. I am attaching the script and sample data(dummy_image_data.zip), you can try the experiment with total_no_of_prompts = 300 vs total_no_of_prompts = 600 and see the difference. Additional packages which might be required are requests, memory_profiler. I am attaching memory profile screenshots for total_no_of_prompts = 300 vs 600 case. Experiment is carried by passing 150 prompts at once to the LLM class
Link for script -https://github.com/kuladeephx/sample/blob/main/multimodal_sample.py Link for data - https://github.com/kuladeephx/sample/blob/main/dummy_image_data.zip
Thanks, I can reproduce this issue with the provided script. Let me use vLLM's profiler to see what's wrong.
@Isotr0py, any lead on the issue?
I guess it's the Phi-3-Vision's image embedding causing the memory leak. I observed a high memory allocation on CPU about aten::cat
(which is widely used in phi3v image embedding as well):
Have noticed a similar memory increase with pixtral-12B-2409 as well, so wondering if this has to do with vllm
I also profiled the phi3_v
forwarding, seems that the memory leak not related to the model forwarding (the memory increasement from model forwarding only occurs on first time prefill).
The profiling log: phi3v_forward_profile.log
I'm afraid that the source of memory leak comes from other parts of llm_engine
instead :(
(I also noticed similar memory increase on other models besides phi3_v and pixtral)
Thanks for analysis, In that case this seems to affect multimodal models? Do you think the issue can be identified?
And what's the difference between LLM.generate and LLM.chat (excluding the use case of usage as a chat model), I tried the same set with LLM.chat and noticed memory consumption is less compared using LLM.generate
At least in my test, other multimodal models like InternVL2 also encountered this issue. For example, here is the memory profile with InternVL2-1B
:
Since there is no memory leak from the model forwarding, the source of leaking memory might be various (scheduler
or block_manager
etc). I'm afraid that I can't help with this, because I'm not very familiar with these implementations.
And what's the difference between LLM.generate and LLM.chat (excluding the use case of usage as a chat model), I tried the same set with LLM.chat and noticed memory consumption is less compared using LLM.generate
This is interesting, because LLM.chat just applies chat_template before calling LLM.generate.
Thanks, do you think is there anyone else who can help with this issue?
Proposal to improve performance
I am trying to run phi3.5 vision instruct model with around 10k prompts. What I noticed with the increase in prompts my CPU RAM consumption keeps increasing and eventually the process gets killed. Its running fine for say small sample like 1000 prompts. My system configuration is 48 GB VRAM and 64GB CPU RAM. Noticed a similar pattern with PIXTRAL-12B-2409. Has anyone faced this issue?
I have tried the implementation by passing in batches of 1000 to llm.generate but still the CPU RAM keeps increasing Below is the code implementation: Ima using two images per prompt from vllm import LLM, SamplingParams llm = LLM( model="microsoft/Phi-3.5-vision-instruct", gpu_memory_utilization=0.7, trust_remote_code=True, max_model_len=4096, limit_mm_per_prompt={"image": 4}, ) sampling_params = SamplingParams(max_tokens=100, temperature=0.0) outputs = llm.generate(prompt_list, sampling_params=sampling_params)
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Before submitting a new issue...