Closed YangZhou0417 closed 9 months ago
With longer input length, the prefill phase latency would be higher, could you share the model input token count when obtaining the results in this post?
https://pytorch.org/blog/accelerating-generative-ai-2/?utm_content=273712248&utm_medium=social&utm_source=twitter&hss_channel=tw-776585502606721024
Low, maybe 5 tokens?
I see, it would be nice to benchmark on larger context length as the first token latency can increase significantly.
With longer input length, the prefill phase latency would be higher, could you share the model input token count when obtaining the results in this post?
https://pytorch.org/blog/accelerating-generative-ai-2/?utm_content=273712248&utm_medium=social&utm_source=twitter&hss_channel=tw-776585502606721024