Closed Zwc2003 closed 6 months ago
Thx for your inquiry! Versions of CUDA and PyTorch significantly influence the decoding speed.
We recommend focusing on the '#mean accepted tokens' as a more reliable metric for cross-device and cross-environment comparison for Speculative Decoding. Tokens/s and Speedup are reference metrics for comparing different methods on the same device and test environment.
This issue was closed because it has been inactive for 7 days. If there are any other questions, please open a new issue or send me an email.
我们在4090和L40上进行了测试,自回归的tokens/s均达到了50以上,感觉不太合理。是否可能是cuda的版本不同造成的呢?