unsloth with vllm in 8/4 bits

unslothai / unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

12.29k stars 796 forks source link

unsloth with vllm in 8/4 bits #253

Open quancore opened 3 months ago

quancore commented 3 months ago

I have trained qlora model with unsloth and I want to serve with vllm but I did not found a way to serve model in8/4 bits ?

danielhanchen commented 3 months ago

@quancore I'm not sure / unsure if vLLM allows serving in 4 or 8 bits! 16bit yes, but unsure on 4 or 8

quancore commented 3 months ago

@danielhanchen I think it is: https://github.com/vllm-project/vllm/issues/1155

patleeman commented 3 months ago

@danielhanchen I think it is: vllm-project/vllm#1155

Looks like they only support AWQ quantization not via bitsandbytes.

danielhanchen commented 3 months ago

@patleeman Oh ye AWQ is great - I'm assuming you want to quantize it to AWQ?

quancore commented 3 months ago

@patleeman @danielhanchen well yes, maybe we should support AWQ so we can use qlora models with vllm?

marcelodiaz558 commented 3 months ago

Hello there. I am also interested in using with VLLM a 8/4 bits model trained with Unsloth. Currently, it works fine with 16 bits but requires too much VRAM. Is there a way to quantize a model trained with Unsloth using AWQ or GPTQ?

danielhanchen commented 1 month ago

Whoops this missed me - yep having an option to convert it to AWQ is interesting

Louis2B2G commented 1 month ago

Whoops this missed me - yep having an option to convert it to AWQ is interesting

That would be amazing - is this a feature you are planning on adding in the near future?

danielhanchen commented 4 weeks ago

Yep for a future release!

amir-in-a-cynch commented 2 weeks ago

I'm down to volunteer to work on this, if you're accepting community contributions. (I have to do this for my day job anyway, so it might be nice to contribute to the library.)

Serega6678 commented 1 week ago

@amir-in-a-cynch do you plan to do it?

amir-in-a-cynch commented 1 week ago

@amir-in-a-cynch do you plan to do it?

I'll take a stab at it tomorrow and wednesday. Not sure if it'll end up being a clean integration to the API for this library (since it adds a dependency), but at the worst case we should be able to get an example notebook together on how to do it for the docs.

Serega6678 commented 1 week ago

@amir-in-a-cynch great, keep me in touch I don't mind giving you a helping hand if you're stuck at some point

danielhanchen commented 4 days ago

I think vLLM exporting to 8bits is through AWQ - you can also enable float8 support (if your GPU supports it)