Closed tengomucho closed 1 month ago
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
For info, nightly test have been run already on this branch: https://github.com/huggingface/optimum-tpu/actions/runs/11494628686/job/31992471743
LGTM!
What does this PR do?
This integrates the
int8
quantization on TGI as supported on Jetstream Pytorch. This allows to fit larger models for serving, such asmistralai/Mixtral-8x7B-v0.1
. Note that some unexpected behaviour has been observed on some prompts when using other models, such asLlama-3-70B
, so a test has been added but the model is not considered ready for deployment with the current implementation.Before submitting