Closed platypus1989 closed 3 weeks ago
Btw I am running the experiment on an instance with 4 A100.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers.version=4.42.4
Who can help?
@Gante
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi, I am noticing when running batch inference over Mixtral-8x7B-Instruct-v0.1, model seems to be scale nicely (sublinearly) with batch size if input size is small, but when input size gets large (more than 400 tokens), inference time start to become linearly against batch size.
Some sample code to reproduce what I am seeing
You can see from the output, the inference time scales sublinearly against batch size when input size is less than 10. Once the input size increased to more than 20, the inference time starts to scale with batch size somewhat linearly.
Expected behavior
I was expecting batch inference's running time to be scaling sublinearly regardless of the input size to some extent.