neuralmagic / AutoFP8

Apache License 2.0
158 stars 20 forks source link

Integration with Hugging Face transformers library #30

Open SunMarc opened 4 months ago

SunMarc commented 4 months ago

Hi neuralmagic team !

Very nice work with AutoFP8 ! We were thinking of integrating AutoFP8 in transformers, so that users can run your checkpoints directly with transformers ! We would simply replace the linear layers by its quantized version. Hence, we would only support the inference. Let us know if you agree with this ! The goal would be to explose the quantized linear layer class in this repo (I see that you have several quantized linear) and import it in transformers.

I will be leading the integration, so any help is appreciated ! Also, are there any big blockers that I might not have seen ?

Thanks in advance !

robertgshaw2-neuralmagic commented 4 months ago

Hey @SunMarc - we are planning to push most of our development into llm-compressor and compressed-tensors which are the successors to this mini-repo that we are already working on integrating it into transformers (https://github.com/huggingface/transformers/pull/31704)

This supports:

We also support the following algorithms which can be applied to both fp8 and int8 and int4 models:

We would prefer to put efforts related to transformers behind this framework (including doing a surge on fp8 and int8 compute with our cutlass kernels that we use in vllm)

robertgshaw2-neuralmagic commented 4 months ago

Couple other notes for fp8 on various compute capabilities:

robertgshaw2-neuralmagic commented 4 months ago

For MoEs: