Integration with Hugging Face transformers library

neuralmagic / AutoFP8

Apache License 2.0

158 stars 20 forks source link

Integration with Hugging Face transformers library #30

Open SunMarc opened 4 months ago

SunMarc commented 4 months ago

Hi neuralmagic team !

Very nice work with AutoFP8 ! We were thinking of integrating AutoFP8 in transformers, so that users can run your checkpoints directly with transformers ! We would simply replace the linear layers by its quantized version. Hence, we would only support the inference. Let us know if you agree with this ! The goal would be to explose the quantized linear layer class in this repo (I see that you have several quantized linear) and import it in transformers.

I will be leading the integration, so any help is appreciated ! Also, are there any big blockers that I might not have seen ?

Thanks in advance !

robertgshaw2-neuralmagic commented 4 months ago

Hey @SunMarc - we are planning to push most of our development into llm-compressor and compressed-tensors which are the successors to this mini-repo that we are already working on integrating it into transformers (https://github.com/huggingface/transformers/pull/31704)

This supports:

mixed precision w4a16 / w8a16
w8a8 int8 (activation quantization)
w8a8 fp8 (float point quantization)

We also support the following algorithms which can be applied to both fp8 and int8 and int4 models:

ptq
gptq
smoothquant
sparsegpt

We would prefer to put efforts related to transformers behind this framework (including doing a surge on fp8 and int8 compute with our cutlass kernels that we use in vllm)

robertgshaw2-neuralmagic commented 4 months ago

Couple other notes for fp8 on various compute capabilities:

For pre-ampere, we could emulate (since the conversion is quick)
For ampere GPUs, we can add support for our fp8 marlin kernels (mixed precision)
For lovelace/hopper GPUs we can use torch.scaled_mm to make it easy

robertgshaw2-neuralmagic commented 4 months ago

For MoEs:

we could consider adding the triton kernels we have in vllm as well