turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

Integration with Hugging Face transformers library #461

Open SunMarc opened 1 month ago

SunMarc commented 1 month ago

Hi @turboderp !

Would you be open to integrate exllamav2 library with HF transformers. The goal would be to make exl2 quantized model compatible with HF transformers using your kernels. We would simply replace the linear layers by its quantized version. Moreover, we would only support the inference. The conversion will be done using your scripts.

We recently created an HfQuantizer to facilitate the integration of new quantization libraries on HF transformers: https://huggingface.co/docs/transformers/main/en/hf_quantizer - the code changes should be quite easy and the whole community would benefit from an easy API !

See an example here: https://github.com/huggingface/transformers/pull/30262 of a recent integration

I can also help for the integration if needed !

Thanks in advance

turboderp commented 1 month ago

I'm definitely open to this. There are maybe a couple of complications, though.

First off, a lot of existing EXL2 models don't have a correct model.safetensors.index.json file, because they're just copies of the original HF models with just the .safetensors files replaced. ExLlamaV2 doesn't care about this index (each .safetensors file has an embedded index in its header anyway), but I understand that the HF model loader needs it. (?)

More problematically, for a few architectures some tensors end up being renamed during quantization. Also, fused layers are split, and tied embeddings are duplicated to produce both an FP16 embedding table (meant to typically reside in system RAM) and a quantized output layer.

The idea has been to keep the forward pass as similar as possible between architectures, so architectures are defined by a few dozen switchable parameters rather than an individual Python implementation for each. This greatly simplifies quantization, which involves loading the original model one layer at a time so that it becomes feasible on consumer devices. It also simplifies inference, which re-implements a lot of functionality in the C++ extension.

But I guess the tradeoff is that some converted models won't work with their original HF implementations, even if the linear layers are replaced. There would have to be some architecture-specific tensor mappings, at least. I guess layers could also be re-fused in their quantized state since any un-fused tuples would have the same activation order.

In any case, it wouldn't be complicated to create a linear layer class deriving from nn.Module. I guess it's basically the existing ExLlamaV2Linear class but with registered buffers so Transformers has somewhere to load the weights into. Should it rely on the C++ extensions in the exllamav2 package or would you want those functions included in Transformers directly?

SunMarc commented 1 month ago

Awesome, thank you @turboderp !

First off, a lot of existing EXL2 models don't have a correct model.safetensors.index.json file, because they're just copies of the original HF models with just the .safetensors files replaced. ExLlamaV2 doesn't care about this index (each .safetensors file has an embedded index in its header anyway), but I understand that the HF model loader needs it. (?)

Yes, this index is needed for HF model loader. Let's first try to make it work without modifying the index since most models didn't update the index. If we don't manage to make it work, we can expose a simple conversion script for the index.

More problematically, for a few architectures some tensors end up being renamed during quantization. Also, fused layers are split, and tied embeddings are duplicated to produce both an FP16 embedding table (meant to typically reside in system RAM) and a quantized output layer.

The idea has been to keep the forward pass as similar as possible between architectures, so architectures are defined by a few dozen switchable parameters rather than an individual Python implementation for each. This greatly simplifies quantization, which involves loading the original model one layer at a time so that it becomes feasible on consumer devices. It also simplifies inference, which re-implements a lot of functionality in the C++ extension.

But I guess the tradeoff is that some converted models won't work with their original HF implementations, even if the linear layers are replaced. There would have to be some architecture-specific tensor mappings, at least. I guess layers could also be re-fused in their quantized state since any un-fused tuples would have the same activation order.

Yes, that's not a big issue with some models doesn't work on HF transformers. We can redirect them to exllamav2 library if that's the case. As for additional features such as fusing, we can potentially add them just like how we did for awq here. We can also modify more than the linear layer for some models if you think that it is worth it. For example, we had to replace the scale here for some models.

In any case, it wouldn't be complicated to create a linear layer class deriving from nn.Module. I guess it's basically the existing ExLlamaV2Linear class but with registered buffers so Transformers has somewhere to load the weights into. Should it rely on the C++ extensions in the exllamav2 package or would you want those functions included in Transformers directly?

Yes, that's right ! The easiest way would be to have those functions included directly in Transformers, so we would just do from exllamav2 import Exllamav2Linear and replace the layers.

Thanks again !

laoda513 commented 1 month ago

it would be nice if hf can support not only inference but also training...

SunMarc commented 1 month ago

Hi @laoda513, this can be easily done with peft through a PR. You can already fine-tune models that are quantized with BNB, AWQ and GPTQ.

laoda513 commented 1 month ago

BNB

It shouldn't be that simple. Currently, exllamav2 does not have an implementation for differentiation, so it's not possible to train the model. Using peft probably won't change this situation.

The transformer is an impressive project, but to be honest, I find it a bit cumbersome and impractical, especially for quantized models.

BNB: A veteran quantization library, but it's clear that its accuracy is the worst, and it's very slow.

GPTQ: I mainly use this format at the moment because it can be inferred by exllamav2 and has a lot of training techniques support. However, its accuracy is not as good as AWQ and exl2.

AWQ: This technique is very good, with good accuracy, its built-in inference (with fused layer), and the inference core of exllamav2 both have good speed. It also inherits from the hf's peft library and supports training. However, when inference needs to load a lora file, its inference speed is very slow (does not support fused layer and exllamav2 core).

exl2: It has the best accuracy and speed, but does not support training.

In conclusion, you will find that there is currently no perfect quantization technique that can simultaneously achieve both accuracy and speed while supporting training. If you want to balance all aspects, you need to do quite a bit of work.

In this case, although hf claims to support different quantization methods (AWQ, BNB, GPTQ), it hasn't introduced any additional functionality; it simply integrates existing functionalities of these formats. However, these functionalities themselves are quite complex, which means that although it seems like you only need to make a simple code change to use the transformer with existing formats, if you want deeper control, using the transformer will require more effort and may not bring as much benefit compared to using the original projects.

Of course, the above is just my personal experience.

The transformer is a great project, and I don't think supporting exl2 is a bad thing. Thanks to Transformer and everyone for their contributions to the open-source community and AI development.