Open lance0108 opened 1 month ago
Hi!
I have a related request regarding Phi-3-128k model. Specifically for 128k model, sliding window is disabled. According to gugarosa from microsoft "sliding_window is not supported by the LongRoPE implementation according to the authors."
However when launching Phi-3 with docker, I observed the statement that vllm "Cannot use flash attention-2 backend due to sliding window".
Is there any way re-enable flash attention? Thanks!
[Edit: Just found out that we just need to add --disable_sliding_window
to re-enable flash attention!]
Here is line (edited) I used to launch the docker. """ sudo docker run --runtime nvidia --gpus all \ -v ~/nlp/storage/hf_cache:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN=<???>" \ -p 8001:8001 --ipc=host vllm/vllm-openai:latest \ --model microsoft/Phi-3-vision-128k-instruct \ --trust-remote-code \ --max-model-len 8192 \ --port 8001 \ --disable_sliding_window \ """
@lance0108 could you please try your analysis again with --disable_sliding_window
set?
@lance0108 could you please try your analysis again with
--disable_sliding_window
set?
Thanks!
Just rerun the data with --disable_sliding_window
on the vllm server. Same issue.
@lance0108 how are you making recurrent requests. I'm sending in 64 requests at a rate of 16 per second and not seeing issues with the vLLM endpoint. I have used:
--disable_sliding_window --trust-remote-code
Your current environment
I'm not able to run
collect_env.py
on this workstationvllm == 0.5.1 vllm-flash-attn == 2.5.9 torch == 2.3.0
Tested on a single A100-80GB
The following message was observed:
🐛 Describe the bug
Issue:
LLM.generate
withmax_num_seqs=64
.I was not able to reproduce this using single calls in Scenario 2. With a single call to the sever, all responses seemed correct.
My speculation is that the batching computation for Phi-3 is incorrect.