Open Bihan opened 5 days ago
Examining the results I found that
All the errors are Never received a valid chunk to calculate TTFT.This response will be marked as failed!
I have attached the result.
I assume the issue is related to this commit.
I did a quick test with latest release v0.6.3.post1
on 1xA6000
with llama 3.1-8b
and there is 100% successful requests.
Your current environment
I getting so poor successful requests even with low QPS = 0.2 and no error in logs.
vllm serve and engine arguments i used
# vllm serve meta-llama/Llama-3.1-405B-FP8 --download-dir /root/.cache --tensor-parallel-size 8 --max-num-batched-tokens 1024 --max-num-seqs 1024 --max-seq-len-to-capture 8192 --num-scheduler-step 15 --max-model-len 33344
How would you like to use vllm
I want 100% successful requests
Before submitting a new issue...