huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.85k stars 1.04k forks source link

Support exllamav2 (exl2) quantized models models #1694

Closed sapountzis closed 2 months ago

sapountzis commented 6 months ago

Feature request

Add support for exl2 quantization format via argument --quantization exl2 which will allow to load exllamav2 quantized models with various quantization schemes (not GPTQ).

Motivation

There are lots of models on HF which are only offered in either F16 of exl2 format. Our use case particularly concerns role playing fine tuned models with domain specific calibration set. The throughput and output quality difference between bitsandbytes nf4 and exllamav2 calibrated quantization is significant from our testing.

Implementing support for exl2 quantized models will increase the adoption of TGI as a serving framework.

Your contribution

suparious commented 5 months ago

Currently, the IBM fork (IBM/text-generation-inference) of this project, has enabled exl2. I wonder if that solution could be backported here.

Narsil commented 5 months ago

There are lots of models on HF which are only offered in either F16 of exl2 format

Could you point to some ?

Exl2 is definitely on our todo list with Marlin.

sapountzis commented 5 months ago

@Narsil for example these models:

https://huggingface.co/LoneStriker https://huggingface.co/zaq-hack

Additionally some of the RP models have been quantized using an RP calibration dataset which produces significantly better (for the use case) outputs than models quantized with general datasets (e.g. wikitext)

sapountzis commented 5 months ago

@Narsil Is there any way to contribute to accelerate progress of implementing this?

RonanKMcGovern commented 5 months ago

There are lots of models on HF which are only offered in either F16 of exl2 format

Could you point to some ?

Exl2 is definitely on our todo list with Marlin.

Yeah I think Marlin makes sense because the kernels are much better than AWQ, making it much faster (close to fp16).

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.