hyperonym / basaran

Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models.
MIT License
1.29k stars 81 forks source link

QLoRa support #202

Closed bitnom closed 1 year ago

bitnom commented 1 year ago

does it qlora?

peakji commented 1 year ago

No yet. But QLoRA, GPTQ, and 4-bit quantization are on the todo list.

LoopControl commented 1 year ago

@peakji GPTQ would be fantastic. The 4 bit implementation in bitsandbytes has very slow inference speeds (like 8X slower).

For GPTQ integration, AutoGPTQ is ideal since it provides a higher level abstraction than the low-level and always changing gptq-for-llama repo.

0xDigest commented 1 year ago

No yet. But QLoRA, GPTQ, and 4-bit quantization are on the todo list.

It'd be interesting if this could support multiple LoRa adapters[0] that could be swapped using the unused model parameter.

[0] https://github.com/huggingface/peft/blob/main/examples/multi_adapter_examples/PEFT_Multi_LoRA_Inference.ipynb

peakji commented 1 year ago

4-bit quantization with QLoRA is added in https://github.com/hyperonym/basaran/pull/209.

Feel free to open another issue for GPTQ integration.