LoRA & p-tuning with multi-GPU

princeton-nlp / MeZO

[NeurIPS 2023] MeZO: Fine-Tuning Language Models with Just Forward Passes. https://arxiv.org/abs/2305.17333

MIT License

1.02k stars 60 forks source link

LoRA & p-tuning with multi-GPU #22

Open haozhouamzn opened 11 months ago

haozhouamzn commented 11 months ago

Hi, in table 20, it shows prefix FT with 2 and 4 GPUs. How are those obtained? I tried using MODEL=facebook/opt-13b TASK=SST2 MODE=prefix LR=1e-5 NUM_GPU=8 bash finetune_fsdp.sh, but got some errors.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:5! (when checking argument for argument index in method wrapper_CUDA__index_select)

gaotianyu1350 commented 11 months ago

Hi,

You do not need to use fsdp for mezo multi-gpu, since mezo only requires model inference. You should be able to directly run it with mezo.sh (same as what the README instructed for single GPU without any code/script change). Just make sure there are 2 available GPUs.

haozhouamzn commented 11 months ago

Thanks, yes, MeZO works out-of-box.

How about first-order prefix FT (Prefix FT column in table 20)? The results on 13B, 30B, and 66B used FSDP, right?

gaotianyu1350 commented 10 months ago

Yes, and you should be able to run them via the following command (from readme):

# Full-parameter fine-tuning using fully-sharded data parallel or FSDP (multi-GPU)
MODEL=facebook/opt-13b TASK=SST2 MODE=ft LR=1e-5 NUM_GPU=4 bash finetune_fsdp.sh

You can change the MODE to prefix or lora