Successfully run training in 4bit mode, while the training speed is very slow

tloen / alpaca-lora

Instruct-tune LLaMA on consumer hardware

Apache License 2.0

18.57k stars 2.21k forks source link

Successfully run training in 4bit mode, while the training speed is very slow #56

Open johnsmith0031 opened 1 year ago

johnsmith0031 commented 1 year ago

Here's some code needed for this adjustments. https://github.com/johnsmith0031/alpaca_lora_4bit Don't know why the training is so slow.

gururise commented 1 year ago

How slow compared to 8-bit?

tloen commented 1 year ago

I'm not too surprised, there aren't any good pytorch libraries for doing int4 on the tensor cores

johnsmith0031 commented 1 year ago

How slow compared to 8-bit?

Had not tried 8-bit yet, but apparently could not complete 3 epoch on instruction dataset in 5 hours.

devilismyfriend commented 1 year ago

But why finetune on 4bit? afaik it's ok for inference but extremely bad for training

Maybe we need a way to covert the lora to 4bit instead?

gururise commented 1 year ago

But why finetune on 4bit? afaik it's ok for inference but extremely bad for training

Because in 4-bit you can probably fine-tune the 30b model on a single 4090.

devilismyfriend commented 1 year ago

But why finetune on 4bit? afaik it's ok for inference but extremely bad for training

Because in 4-bit you can probably fine-tune the 30b model on a single 4090.

Doesn't matter if the quality isn't good.

johnsmith0031 commented 1 year ago

Re-inplement the 4bit matmul and increased the training speed by about 20 times.

kooshi commented 1 year ago

You're a legend. I got this running when you first posted it. Tomorrow I'm going to try to train 65b with this plus #131

gururise commented 1 year ago

You're a legend. I got this running when you first posted it. Tomorrow I'm going to try to train 65b with this plus #131

@kooshi If you are training on the alpaca dataset, try using the cleaned dataset and let us know if you get better results.

johnsmith0031 commented 1 year ago

Optimized VRAM usage and now can train LoRA with 30b model on a single 4090

SpaceCowboy850 commented 1 year ago

Just wanted to post I was able to train on 30B with a 4090 as well using your code, johnsmith0031. Thanks for the effort!

fernando-neto-ai commented 1 year ago

The quality of the model obtained (30B 4 bits) was good after training on 4 bits? (I am interested because I also own a 4090)

PeiqinSun commented 1 year ago

We also try to implement 4bit-qlora, thanks to the optimized kernel implementation of back-propagation, the fine-tuning speed is similar to 8-bit lora at present. Welcome to use and issue: https://github.com/megvii-research/Sparsebit/tree/main/large_language_models/alpaca-qlora

thusinh1969 commented 1 year ago

Really that bad ?!? I have spent so much time and money on 4-bits last few days, it is not good ...!