Closed sapountzis closed 2 months ago
Currently, the IBM fork (IBM/text-generation-inference) of this project, has enabled exl2. I wonder if that solution could be backported here.
There are lots of models on HF which are only offered in either F16 of exl2 format
Could you point to some ?
Exl2 is definitely on our todo list with Marlin.
@Narsil for example these models:
https://huggingface.co/LoneStriker https://huggingface.co/zaq-hack
Additionally some of the RP models have been quantized using an RP calibration dataset which produces significantly better (for the use case) outputs than models quantized with general datasets (e.g. wikitext)
@Narsil Is there any way to contribute to accelerate progress of implementing this?
There are lots of models on HF which are only offered in either F16 of exl2 format
Could you point to some ?
Exl2 is definitely on our todo list with Marlin.
Yeah I think Marlin makes sense because the kernels are much better than AWQ, making it much faster (close to fp16).
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Feature request
Add support for exl2 quantization format via argument --quantization exl2 which will allow to load exllamav2 quantized models with various quantization schemes (not GPTQ).
Motivation
There are lots of models on HF which are only offered in either F16 of exl2 format. Our use case particularly concerns role playing fine tuned models with domain specific calibration set. The throughput and output quality difference between bitsandbytes nf4 and exllamav2 calibrated quantization is significant from our testing.
Implementing support for exl2 quantized models will increase the adoption of TGI as a serving framework.
Your contribution