QLora Support - Githubissues

huggingface / text-generation-inference

Large Language Model Text Generation Inference

http://hf.co/docs/text-generation-inference

Apache License 2.0

8.34k stars 943 forks source link

QLora Support #381

Open sam-h-bean opened 1 year ago

sam-h-bean commented 1 year ago

Feature request

Add 4-bit quantization support when bitsandbytes releases.

Motivation

Run larger models easily and performantly

Your contribution

I could make a PR if this is a reasonably easy first task or 2.

gsaivinay commented 1 year ago

Hello,

Wondering if you got any insights on what would be the inference performance of bitsandbytes 4bit compared to 8bit. Will it be any better? In my experience, 8bit is around 8x slower compared to fp16. Yet to try llama GPTQ(waiting for it be available in this server).

schauppi commented 1 year ago

Hello,

Wondering if you got any insights on what would be the inference performance of bitsandbytes 4bit compared to 8bit. Will it be any better? In my experience, 8bit is around 8x slower compared to fp16. Yet to try llama GPTQ(waiting for it be available in this server).

Is it planned to support GPTQ models with this Server?

OlivierDehaene commented 1 year ago

In my experience, 8bit is around 8x slower compared to fp16

Yes, bitsandbytes adds a lot of CPU bottleneck and the kernels are slower than the native ones. It is expected from this type of online quantization strategy.

what would be the inference performance of bitsandbytes 4bit compared to 8bit

We are working with the author of bnb but I don't have numbers ready to share at this moment.

GPTQ (waiting for it be available in this serve)

This will be available in the future. We need to iterate on the design a bit more but it is already powering some of our key Huggingface Inference API models.

tienthanhdhcn commented 1 year ago

Thanks @OlivierDehaene, is there any support for LORA?

LoopControl commented 1 year ago

Yep, 4bit inference with bnb is super slow.

GPTQ is pretty fast though. On my hardware it's actually faster than inferencing with fp16.

There's a high-level library called Autogptq (https://github.com/PanQiWei/AutoGPTQ) that makes adding GPTQ support just a couple lines (the original gptq-for-llama library is tougher to integrate and tends to have random breaking changes).

TLDR: Would love to GPTQ support added. It's the only way I can load larger models.

gsaivinay commented 1 year ago

Would love to GPTQ support added

There is a PR opened for adding GPTQ support for llama #267 not sure if it will be amended to support all the other models as well. Eagerly waiting for this.

Narsil commented 1 year ago

Not in this PR, this PR is the dirty work, there's a lot of legwork but yes all models will be supported as much out of the box as possible