huggingface / optimum-tpu

Google TPU optimizations for transformers models
Apache License 2.0
75 stars 19 forks source link

Quantization Jetstream Pytorch #111

Closed tengomucho closed 1 month ago

tengomucho commented 1 month ago

What does this PR do?

This integrates the int8 quantization on TGI as supported on Jetstream Pytorch. This allows to fit larger models for serving, such as mistralai/Mixtral-8x7B-v0.1. Note that some unexpected behaviour has been observed on some prompts when using other models, such as Llama-3-70B, so a test has been added but the model is not considered ready for deployment with the current implementation.

Before submitting

HuggingFaceDocBuilderDev commented 1 month ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tengomucho commented 1 month ago

For info, nightly test have been run already on this branch: https://github.com/huggingface/optimum-tpu/actions/runs/11494628686/job/31992471743

baptistecolle commented 1 month ago

LGTM!