This first modifies the TGI continuous batching implementation to take advantage of transformers-neuronx implementation.
Instead of dropping the KV cache when adding new requests and rebuilding it from cached texts, we simply omit the pending requests when calling model.forward, specifying only the indices of the new requests to prefill.
A llama TGI unit test is specifically added to verify the results are still correct after that change (for Llama and Mistral, transformers-neuronx continuous batching is always on).
For Sagemaker deployment, some disk usage logs are added when fetching/exporting a model.
During export, the model generation config is fetched to provide default values.
What does this PR do?
transformers-neuronx
implementation.Instead of dropping the KV cache when adding new requests and rebuilding it from cached texts, we simply omit the pending requests when calling model.forward, specifying only the indices of the new requests to prefill.
A llama TGI unit test is specifically added to verify the results are still correct after that change (for Llama and Mistral,
transformers-neuronx
continuous batching is always on).For Sagemaker deployment, some disk usage logs are added when fetching/exporting a model.
During export, the model generation config is fetched to provide default values.