huggingface / optimum-tpu

Google TPU optimizations for transformers models
Apache License 2.0
66 stars 17 forks source link

feat: use dynamic batching when generating #9

Closed tengomucho closed 6 months ago

tengomucho commented 6 months ago

What does this PR do?

This will increment the number of batches dynamically when you call prefill, and it will reduce the number of batches only when prefill is called again. The intention is to avoid useless recompilation (keeping batch size the same as long as possible).

mfuntowicz commented 6 months ago

Just to be sure: In this PR we are not limiting the maximum batch size the server can handle? If so, can we impl this in a following PR?