microsoft / LoRA

Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"
https://arxiv.org/abs/2106.09685
MIT License
9.71k stars 618 forks source link

Different hyper-parameter between the paper and the code? (lora_alpha and a global batch size) #37

Open t-hyun opened 1 year ago

t-hyun commented 1 year ago

Hello, thank you for sharing the source code. While trying to reproduce SST2 task result with RoBERTa-base model, I've encountered some questions regarding the hyper-parameters, lora_alpha, and a global batch size. Since the paper's hyper-parameter setting and the reproducing script which does both training and evaluation (examples/NLU/roberta_base_sst2.sh) had some conflict.

First of all, is the reproducing script the actual script that you used for creating the numbers for the paper?

스크린샷 2022-11-22 오전 10 38 19
  1. lora-alpha (8 or 16?) I'd like to know the exact lora-alpha that you used in training. In Appendix D, lora_alpha is 8. However, in examples/NLU/roberta_base_sst2.sh, it is written that lora-alpha is 16. 스크린샷 2022-11-22 오전 10 58 43

    https://github.com/microsoft/LoRA/blob/70ca1efd17b6ca4a45bbdba98554d5b312a8d48c/examples/NLU/roberta_base_sst2.sh#L24

When I tried evaluation, lora-alpha 16 gave the better result.

Maybe you used lora_alpha as 8 in training, but lora_alpha was 16 in evaluation or else... it's a little bit confusing.

  1. global batch size while training (16? 64? 128? or else?) In Appendix D, it is written that the batch size is 16, so I thought 16 was the global batch size while training. However, in examples/NLU/roberta_base_sst2.sh, it is written that per_device_train_batch_size is 16 and the number of gpu is 8. (So the global batch size should be 128) Moreover, the explanation in https://github.com/microsoft/LoRA/tree/main/examples/NLU#adapting-to-the-glue-benchmark said that the number of gpu used is 4. (So the global batch size should be 64)

When the global batch size was 128, my reproduction result was lower than in paper. (94.5 accuracies) Thanks.

  1. weight decay of AdamW optimizer The weight decay hyperparameter was in the script examples/NLU/roberta_base_sst2.sh, but was not present in the paper (for the GLUE tasks) Did you use the weight decay parameter?

I wrote down your hyper-parameter setting like this, and I'd appreciate the specification.

스크린샷 2022-11-22 오후 12 03 18
Bannng commented 1 year ago

I have some exact same questions just like @t-hyun mentioned. It would be really appreciate to respond his questions especially in terms of lora_alpha!

I cannot clearly understand the effect of lora_alpha / lora rank r ratio in 'merging parameters at training' isn't it just multiplying learning rate twice when we set (lora_alpha / rank r) into 2?

Some other posts are set this ratio 1.0 as the default in training (fully-same as finetuning) and then use lower/higher ratio at the inference time in order to interpolate the effect of fusing updated parameters into originally pre-trained one.

Thus, can you explain the effect of alpha/rank r ratio more clearly? Thanks!

roshan-gopalakrishnan commented 11 months ago

Is there any reply to these questions ?

thusinh1969 commented 4 months ago

I used rank 32 and alpha 16, lr 1e-4 and global batch 128. Works well. Steve