Open SunMarc opened 4 months ago
Hey @SunMarc - we are planning to push most of our development into llm-compressor
and compressed-tensors
which are the successors to this mini-repo that we are already working on integrating it into transformers (https://github.com/huggingface/transformers/pull/31704)
This supports:
We also support the following algorithms which can be applied to both fp8 and int8 and int4 models:
We would prefer to put efforts related to transformers
behind this framework (including doing a surge on fp8 and int8 compute with our cutlass kernels that we use in vllm)
Couple other notes for fp8 on various compute capabilities:
For MoEs:
Hi neuralmagic team !
Very nice work with AutoFP8 ! We were thinking of integrating AutoFP8 in transformers, so that users can run your checkpoints directly with transformers ! We would simply replace the linear layers by its quantized version. Hence, we would only support the inference. Let us know if you agree with this ! The goal would be to explose the quantized linear layer class in this repo (I see that you have several quantized linear) and import it in transformers.
I will be leading the integration, so any help is appreciated ! Also, are there any big blockers that I might not have seen ?
Thanks in advance !