tengomucho commented 1 month ago

What does this PR do?

This integrates the int8 quantization on TGI as supported on Jetstream Pytorch. This allows to fit larger models for serving, such as mistralai/Mixtral-8x7B-v0.1. Note that some unexpected behaviour has been observed on some prompts when using other models, such as Llama-3-70B, so a test has been added but the model is not considered ready for deployment with the current implementation.

Before submitting

[x] Did you make sure to update the documentation with your changes?
[x] Did you write any new necessary tests?

HuggingFaceDocBuilderDev commented 1 month ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tengomucho commented 1 month ago

For info, nightly test have been run already on this branch: https://github.com/huggingface/optimum-tpu/actions/runs/11494628686/job/31992471743

baptistecolle commented 1 month ago

LGTM!

huggingface / optimum-tpu

Quantization Jetstream Pytorch #111

What does this PR do?

Before submitting