hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
35.18k stars 4.35k forks source link

[Question]support for quantization algorithms that are not performed on-the-fly #5423

Open wenhuach21 opened 2 months ago

wenhuach21 commented 2 months ago

Reminder

System Info

None

Reproduction

None

Expected behavior

None

Others

Hi, Thank you for the fantastic work on LLaMA Factory! I’ve noticed that the repository supports both quantized models generated by various algorithms and on-the-fly quantization.

I am curious if LLaMA Factory is open to contributions of quantization algorithms that are not performed on-the-fly. We have open-source AutoRound that serves as a strong alternative to existing methods. We could contribute if it's ok to you.

github

User experience on Finetuning

hiyouga commented 2 months ago

Sure! We welcome open-source innovations to be integrated into LlamaFactory. Currently we are using the PEFT library to support QLoRA fine-tuning. Feel free to submit a PR and we will review it soon.

wenhuach21 commented 2 months ago

Thank you !

I’m currently learning your code and plan to implement several changes:

Add YAML Configurations: Introduce some new configurations in YAML to support the quantization process. Export Quantized Models:: export quantized model to GPTQ or AWQ formats to leverage the current pipeline,

Potential Limitations Model Support Currently, we only support LLMs (Large Language Models). We do not yet have a unified API for multimodal models. Bits Support: Our focus is primarily on 4-bit precision. For 2-bit precision, GPTQ asymmetric kernels have accuracy issues, and symmetric quantization has notably lower performance.

Potential Issue UI: as the quantization process needs ~20mins for 7B and 3hrs for 70B on cuda, the webui code may need some change, but I am not familiar with this part.

Please let me know if you have any feedback or suggestions.

hiyouga commented 2 months ago

Hi @wenhuach21 , thanks for your information. It's okay to skip the support for multimodal models for now. Also, since users mostly use 4-bit quantization, it is not necessary to implement the 2-bit quantization. Concerning the Web UI, don't worry about it. We'll take care of the Web UI support so that you can focus on the algorithms.

Best.