huggingface / peft

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
https://huggingface.co/docs/peft
Apache License 2.0
15.87k stars 1.54k forks source link

Feature Request: 4-bit IA3 Support (comparison benchmarks included) #843

Closed His-Wardship closed 1 year ago

His-Wardship commented 1 year ago

Feature request

Further to a request from @pacman100 in #794, I have performed some controlled testing of IA3 fine-tuning against 8-bit LoRA and 4-bit QLoRA. I list these results below. IA3 strongly outperforms LoRA in both training speed and memory requirements. Anecdotally, IA3 provides significantly higher quality training results (in terms of fluency, proficiency at fine-tuned task etc.) after merging to the base model, however I do not have an objective benchmark by which to measure this and so will not be making further reference to this point. If anyone can suggest an appropriate benchmark and dataset, I can attempt to produce comparison data on this, time/resources permitting.

Based on the strong performance of IA3 at 8-bit, on the assumption that 4-bit training would offer at least some reductions in memory requirements without outsized losses in training quality, I suggest that developing 4-bit IA3 support is of great value for reducing the hardware requirements for superior quality fine-tuning.

Motivation

With the recent addition of support as low as 8-bit for IA3 within PEFT, I was motivated to test it as an alternative to LoRA fine-tuning. IA3 is currently relatively unknown within the broader community and is not generally supported by GUI fine-tuning tools or other related libraries. It was necessary to either edit the source code of existing tools to re-purpose them for IA3 or to write the training script manually. It was also necessary to write patches to ensure compatibility between libraries such as Flash Attention and IA3, as Flash Attention only currently supports FP16 and BF16 training. This was addressed by recasting the relevant layers to BF16 prior to training.

In the course of a personal project, I have tested sets of hyperparameters for LoRA, QLoRA and IA3 fine-tuning for comparison. All of these tests were performed using a slightly modified version of Axolotl, due to its convenient integration of optimisation tools. Modifications to the Axolotl source code were made to (i) support IA3 training in the first place and (ii) support recasting of the relevant layers for Flash Attention purposes (as the current implementation did not support IA3).

The testing data are below:

System Specifications: OS: Ubuntu 22.04 GPU: RTX 3090Ti CPU: i7 13700k

Software Specifications: Python: 3.10.12 Pytorch: 2.1.0.dev20230814+cu121 CUDA: 12.1 PEFT: 0.5.0.dev0 Transformers: 4.32.0.dev0

Model: Llama_2-13B Dataset: 3,000 Alpaca-format instruction-input-output sets

While IA3 does not support 4-bit (hence this Issue), I nevertheless also compared against 4-bit training to highlight the competitivity of IA3.

With Batch Size: 2, Gradient Accumulation: 2, Sequence Length: 4096 Training Details Training Time VRAM Consumption
8-bit IA3 47 minutes 22.46GB
4-bit QLoRA - Rank 32, Alpha 128 57 minutes 17.57GB
4-bit QLoRA - Rank 128, Alpha 256 58 minutes 19.86GB
8-bit LoRA - Rank 32, Alpha 128 N/A OOM
8-bit LoRA - Rank 128, Alpha 256 N/A OOM
With Batch Size: 1, Gradient Accumulation: 2, Sequence Length: 2048 Training Details Training Time VRAM Consumption
8-bit IA3 55 minutes 16.64GB
4-bit QLoRA - Rank 32, Alpha 128 1 hour 8 minutes 11.28GB
4-bit QLoRA - Rank 128, Alpha 256 1 hour 10 minutes 14.42GB
8-bit LoRA - Rank 32, Alpha 128 57 minutes 17.84GB
8-bit LoRA - Rank 128, Alpha 256 1 hour 3 minutes 20.53GB

As can be seen, IA3 uses less memory and trains faster than 8-bit LoRA. I would appreciate any suggestions on how to properly benchmark training quality, as I strongly believe IA3 outperforms LoRA in this respect also. The original IA3 paper also suggests that training quality should be closer to that of full fine-tuning.

It goes without mentioning that the size of the IA3 adapter is vastly smaller than that of LoRA or QLoRA. Depending on the number of linear modules targeted, my IA3 adapters varied in size between 2MB and 4MB. Separately, it should be added that IA3 supports high quality training with far higher learning rates than LoRA. As such, training times can be reduced significantly by using higher learning rates with fewer steps which would generally lead to overfitting if used with LoRA. Again, I cannot produce data on this, but would appreciate guidance on benchmarking so that such data could be produced.

Your contribution

I hope the above data are useful. If training quality benchmarks can be suggested that are practicable on my home device, I am happy to attempt this. However, I have no formal training or qualification in this area (or anything related to CS - I am actually a lawyer), and had not even typed a line of Python prior to ~4 weeks ago. As such, my ability to contribute may be limited.

BenjaminBossan commented 1 year ago

Hi @His-Wardship, thanks so much for posting your benchmarks. I really hope this is going to be useful for others who may want to dip their toes into IA³ but were hesitant if it's worth it.

If you have any code that you could share, that would be awesome. Maybe it could even be added to the PEFT repo.

However, I have no formal training or qualification in this area (or anything related to CS - I am actually a lawyer), and had not even typed a line of Python prior to ~4 weeks ago

Wow, unbelievable.

His-Wardship commented 1 year ago

Hi @His-Wardship, thanks so much for posting your benchmarks. I really hope this is going to be useful for others who may want to dip their toes into IA³ but were hesitant if it's worth it.

If you have any code that you could share, that would be awesome. Maybe it could even be added to the PEFT repo.

I've actually taken a stab at implementing 4-bit IA3 myself, which works at least on first appearances (i.e., the training loop doesn't crash, it produces a working IA3 adapter and the adapter works as expected when loaded (including in 4-bit) and used for inference.). It's entirely possible (and frankly, likely!) there are some deeper problems with it which someone with more experience than I would be able to identify, but it felt useful to get the proverbial ball rolling - and I've been able to use it for my own projects anyway. PR is #864.

Wow, unbelievable.

Largely, I've just read the source code and referred to the documentation. If there's one thing nearly a decade of legal education prepared me for, it's reading documents and referring to source materials! I should add, all of this is essentially just semi-researched copy-pasting or replication, the mere fact that it works doesn't at all mean it's well-written! Hopefully the PR at least prompts some more skilled individuals to refine it into something even more efficient!