johnsmith0031 / alpaca_lora_4bit

MIT License
533 stars 84 forks source link

Fine-tuning 65b #75

Closed ehartford closed 1 year ago

ehartford commented 1 year ago

I want to fine-tune 65b.

I have a 4090.

Should I switch to dual-3090 with nvlink? Or, use a 2nd 4090 without nvlink?

Would either of these options enable me to fine-tune 65b?

nepeee commented 1 year ago

For the 65b version, you need a lot of GPU vram, maybe an A100 with 80GB vram can do the job but i never tested. Current code has no support for model parallel training, only data parallel. So you need at least one GPU with enough vram but it will be extremely slow with a single GPU.

ehartford commented 1 year ago

For 4-bit LoRA I need 80gb of VRAM?

nepeee commented 1 year ago

You need 30GB vram just for the weights, some to run the model itself, some for the training data and more for the optimizer state too. Maybe it can run on a 48GB gpu with a small batch(never tested it) but it definitely need more if you want to train with the full 2048 token context size. With gradient accumulation you can train the 13B with the full 2048 token context at ~21GB vram usage.

ehartford commented 1 year ago

on my single 24gb 4090, I can finetune 7b with --cutoff_len 2048 I can finetune 13b with --cutoff_len 2048 I can finetune 30b but only with --cutoff_len 1024 and I can't finetune 65b

It seems like you are saying, I still can't finetune 65b even with two cards, whether they are 3090s + nvlink or 4090s. Is that the case?

nepeee commented 1 year ago

You can, but you need to run the model on 2 GPU's (model parallel), as far as i know there is no 4 bit code to do that. If you have the GPU's and the time to play with the code maybe it can work with fsdp or other model parallel methods.