Open WoosukKwon opened 2 months ago
@jpvillam-amd Could you please take a look?
@WoosukKwon I just build a new docker from the latest vllm code, and I got comparable throughput perf on llama2-70b. Those tokenizer messages can be turned off.
TOKENIZERS_PARALLELISM=false python3 /app/vllm/benchmarks/benchmark_throughput.py --dataset "$dataset_path" --model "$model_path" benchmark_throughput.py -tp 4 --enforce-eager
After some research, here is the explanation. During the first run of the vllm benchmarking script, it will need autotune and kernel compilation, and those were counted into the total time as well, therefore resulting in lower numbers. Subsequent vllm benchmarking runs would be good.
Your current environment
🐛 Describe the bug
I ran
benchmark_throughput.py
and found that it printed the same tokenizer warning repeatedly:The warning didn't appear when using CK FlashAttention by setting
VLLM_USE_TRITON_FLASH_ATTN=0
. Also, the performance when using Triton FA was much lower than when using CK FA. I guess Triton compiles or auto-tunes the kernel repeatedly for some reason.