Open sh1ng opened 1 year ago
vLLM now use CUDA12.
Also, I can't confirm your results on RTX 3090
mcl-llm
Statistics: ----------- prefill ----------- throughput: 218.2 tok/s total tokens: 7 tok total time: 0.0 s ------------ decode ------------ throughput: 170.7 tok/s total tokens: 256 tok total time: 1.5 s
vllm(when use 4-bit AWQ model)
Avg latency: 1.4600699121753375 seconds Speed: 175.33 tok/s Speed: 0.00570 s/tok
vLLM now use CUDA12.
Also, I can't confirm your results on RTX 3090
mcl-llm
vllm(when use 4-bit AWQ model)