[Bug]: Severe computation errors when batching request for microsoft/Phi-3-mini-128k-instruct

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

26.02k stars 3.81k forks source link

[Bug]: Severe computation errors when batching request for microsoft/Phi-3-mini-128k-instruct #6438

Open lance0108 opened 1 month ago

lance0108 commented 1 month ago

Your current environment

I'm not able to run collect_env.py on this workstation

vllm == 0.5.1 vllm-flash-attn == 2.5.9 torch == 2.3.0

Tested on a single A100-80GB

The following message was observed:

Cannot use flash attention-2 backend due to sliding window
Using XFormers backend

🐛 Describe the bug

Issue:

Scenario 1: calling LLM.generate with max_num_seqs=64.
Scenario 2: sending requests to Vllm OpenAI compatible server with 50 concurrent calls For both scenarios, many responses do not make any sense. While some responses seemed perfectly reasonable, others contain an excessive amount of white spaces and repeated characters.

I was not able to reproduce this using single calls in Scenario 2. With a single call to the sever, all responses seemed correct.

My speculation is that the batching computation for Phi-3 is incorrect.

AndyW-llm commented 1 month ago

Hi!

I have a related request regarding Phi-3-128k model. Specifically for 128k model, sliding window is disabled. According to gugarosa from microsoft "sliding_window is not supported by the LongRoPE implementation according to the authors."

However when launching Phi-3 with docker, I observed the statement that vllm "Cannot use flash attention-2 backend due to sliding window".

Is there any way re-enable flash attention? Thanks! [Edit: Just found out that we just need to add --disable_sliding_window to re-enable flash attention!]

Here is line (edited) I used to launch the docker. """ sudo docker run --runtime nvidia --gpus all \ -v ~/nlp/storage/hf_cache:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN=<???>" \ -p 8001:8001 --ipc=host vllm/vllm-openai:latest \ --model microsoft/Phi-3-vision-128k-instruct \ --trust-remote-code \ --max-model-len 8192 \ --port 8001 \ --disable_sliding_window \ """

mgoin commented 1 month ago

@lance0108 could you please try your analysis again with --disable_sliding_window set?

lance0108 commented 1 month ago

@lance0108 could you please try your analysis again with --disable_sliding_window set?

Thanks! Just rerun the data with --disable_sliding_window on the vllm server. Same issue.

RonanKMcGovern commented 1 month ago

@lance0108 how are you making recurrent requests. I'm sending in 64 requests at a rate of 16 per second and not seeing issues with the vLLM endpoint. I have used:

 --disable_sliding_window --trust-remote-code