huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
196 stars 59 forks source link

TGI: optimize continuous batching and improve export #506

Closed dacorvo closed 7 months ago

dacorvo commented 7 months ago

What does this PR do?

  1. This first modifies the TGI continuous batching implementation to take advantage of transformers-neuronx implementation.

Instead of dropping the KV cache when adding new requests and rebuilding it from cached texts, we simply omit the pending requests when calling model.forward, specifying only the indices of the new requests to prefill.

A llama TGI unit test is specifically added to verify the results are still correct after that change (for Llama and Mistral, transformers-neuronx continuous batching is always on).

  1. For Sagemaker deployment, some disk usage logs are added when fetching/exporting a model.

  2. During export, the model generation config is fetched to provide default values.