mlc-ai / llm-perf-bench

Apache License 2.0
109 stars 12 forks source link

vllm upgrade to CUDA12 #35

Open sh1ng opened 9 months ago

sh1ng commented 9 months ago

vLLM now use CUDA12.

Also, I can't confirm your results on RTX 3090

mcl-llm

Statistics: ----------- prefill -----------
throughput: 218.2 tok/s
total tokens: 7 tok
total time: 0.0 s
------------ decode ------------
throughput: 170.7 tok/s
total tokens: 256 tok
total time: 1.5 s

vllm(when use 4-bit AWQ model)

Avg latency: 1.4600699121753375 seconds
Speed: 175.33 tok/s
Speed: 0.00570 s/tok