Open Jester6136 opened 3 months ago
Hello @Jester6136... have you made any progress on this issue? What model are you using?
As a sidenote, top_k should be set to an integer as it represents the top k tokens to sample from.
@Jester6136 as far as I understand your method of timing is going to be inaccurate due to gpu inference being an asynchronous process from the cpu-executed code you are sharing. this doesn't preclude the possibility that there is a batching bug, but your measurements might be inaccurate as you are currently measuring them.
one decent explanation of this: https://towardsdatascience.com/the-correct-way-to-measure-inference-time-of-deep-neural-networks-304a54e5187f
I have implemented an inference API using ONNX Runtime and FastAPI to process multiple prompts in batches, with the goal of improving efficiency. However, I've observed that performance is significantly slower with batching compared to processing each prompt individually. When I set the batch_size back to 1, the API performs optimally.
Here is my code: