Closed MichaelMcCulloch closed 4 months ago
Make sense - thanks for the input and pointing this out.
Thank you. Are you able to educate me on why there is no speedup after 128?
You could check out e.g. https://github.com/michaelfeil/infinity/tree/main/docs/benchmarks - for long enough requests, a batch_size of 32 will saturate the GPU usage up to a decent amount, where further vectorization does not bring any benefit when adding more items in the batch dimension. On CPU, even if avx-512 instructions are used, you might not see a decent speedup beyond 4/8.
Feel free to benchmark it (with the included benchmark scripts and report the results here - I would love to learn from them as well!
@MichaelMcCulloch Do you want to PR a fix?
Reproduction:
Expected: List of model dictionary:
Actual: Dictionary:
I have worked around this here, but for this commit to be accepted in the upstream, it would need to adhere to the expected list of models.