Open t-hyun opened 1 year ago
I have some exact same questions just like @t-hyun mentioned. It would be really appreciate to respond his questions especially in terms of lora_alpha!
I cannot clearly understand the effect of lora_alpha / lora rank r ratio in 'merging parameters at training' isn't it just multiplying learning rate twice when we set (lora_alpha / rank r) into 2?
Some other posts are set this ratio 1.0 as the default in training (fully-same as finetuning) and then use lower/higher ratio at the inference time in order to interpolate the effect of fusing updated parameters into originally pre-trained one.
Thus, can you explain the effect of alpha/rank r ratio more clearly? Thanks!
Is there any reply to these questions ?
I used rank 32 and alpha 16, lr 1e-4 and global batch 128. Works well. Steve
Hello, thank you for sharing the source code. While trying to reproduce SST2 task result with RoBERTa-base model, I've encountered some questions regarding the hyper-parameters, lora_alpha, and a global batch size. Since the paper's hyper-parameter setting and the reproducing script which does both training and evaluation (
examples/NLU/roberta_base_sst2.sh
) had some conflict.First of all, is the reproducing script the actual script that you used for creating the numbers for the paper?
examples/NLU/roberta_base_sst2.sh
, it is written that lora-alpha is 16.https://github.com/microsoft/LoRA/blob/70ca1efd17b6ca4a45bbdba98554d5b312a8d48c/examples/NLU/roberta_base_sst2.sh#L24
When I tried evaluation, lora-alpha 16 gave the better result.
Maybe you used lora_alpha as 8 in training, but lora_alpha was 16 in evaluation or else... it's a little bit confusing.
examples/NLU/roberta_base_sst2.sh
, it is written thatper_device_train_batch_size
is 16 and the number of gpu is 8. (So the global batch size should be 128) Moreover, the explanation in https://github.com/microsoft/LoRA/tree/main/examples/NLU#adapting-to-the-glue-benchmark said that the number of gpu used is 4. (So the global batch size should be 64)When the global batch size was 128, my reproduction result was lower than in paper. (94.5 accuracies) Thanks.
examples/NLU/roberta_base_sst2.sh
, but was not present in the paper (for the GLUE tasks) Did you use the weight decay parameter?I wrote down your hyper-parameter setting like this, and I'd appreciate the specification.