vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.02k stars 3.81k forks source link

[Bug]: Severe computation errors when batching request for microsoft/Phi-3-mini-128k-instruct #6438

Open lance0108 opened 1 month ago

lance0108 commented 1 month ago

Your current environment

I'm not able to run collect_env.py on this workstation

vllm == 0.5.1 vllm-flash-attn == 2.5.9 torch == 2.3.0

Tested on a single A100-80GB

The following message was observed:

Cannot use flash attention-2 backend due to sliding window
Using XFormers backend

🐛 Describe the bug

Issue:

I was not able to reproduce this using single calls in Scenario 2. With a single call to the sever, all responses seemed correct.

My speculation is that the batching computation for Phi-3 is incorrect.

AndyW-llm commented 1 month ago

Hi!

I have a related request regarding Phi-3-128k model. Specifically for 128k model, sliding window is disabled. According to gugarosa from microsoft "sliding_window is not supported by the LongRoPE implementation according to the authors."

However when launching Phi-3 with docker, I observed the statement that vllm "Cannot use flash attention-2 backend due to sliding window".

Is there any way re-enable flash attention? Thanks! [Edit: Just found out that we just need to add --disable_sliding_window to re-enable flash attention!]

Here is line (edited) I used to launch the docker. """ sudo docker run --runtime nvidia --gpus all \ -v ~/nlp/storage/hf_cache:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN=<???>" \ -p 8001:8001 --ipc=host vllm/vllm-openai:latest \ --model microsoft/Phi-3-vision-128k-instruct \ --trust-remote-code \ --max-model-len 8192 \ --port 8001 \ --disable_sliding_window \ """

mgoin commented 1 month ago

@lance0108 could you please try your analysis again with --disable_sliding_window set?

lance0108 commented 1 month ago

@lance0108 could you please try your analysis again with --disable_sliding_window set?

Thanks! Just rerun the data with --disable_sliding_window on the vllm server. Same issue.

RonanKMcGovern commented 1 month ago

@lance0108 how are you making recurrent requests. I'm sending in 64 requests at a rate of 16 per second and not seeing issues with the vLLM endpoint. I have used:

 --disable_sliding_window --trust-remote-code