microsoft / LoRA

Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"
https://arxiv.org/abs/2106.09685
MIT License
10.78k stars 688 forks source link

Can't reproduce the results for GLUE CoLA #35

Open fxmarty opened 2 years ago

fxmarty commented 2 years ago

My steps:

git clone https://github.com/microsoft/LoRA.git
cd LoRA
pip install -e .
cd examples/NLU
pip install -e .

Change export num_gpus=8 to export num_gpus=1 in roberta_large_cola.sh

Then CUDA_VISIBLE_DEVICES=0 bash roberta_large_cola.sh

Running on a single A100

Using:

During training, the eval_matthews_correlation is stuck to 0 at all epochs. I actually had the same issue on the current transformers version, and decreasing the learning rate + no warmup helped to regain OKeyish numbers during training, but not as shiny as 0.68.

Do you have an idea of what I could be doing wrong?

Update: using

export num_gpus=1
export CUBLAS_WORKSPACE_CONFIG=":16:8" # https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
export PYTHONHASHSEED=0
export output_dir="./roberta_cola_custom_sh"
python -m torch.distributed.launch --nproc_per_node=$num_gpus \
examples/text-classification/run_glue.py \
--model_name_or_path roberta-large \
--task_name cola \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 8 \  # original: 4
--learning_rate 2e-5 \  # original: 3e-4
--num_train_epochs 20 \
--output_dir $output_dir/model \
--logging_steps 10 \
--logging_dir $output_dir/log \
--evaluation_strategy epoch \
--save_strategy epoch \
--warmup_ratio 0.0 \  # original: 0.06
--apply_lora \
--lora_r 8 \
--lora_alpha 16 \
--seed 0 \
--weight_decay 0.0  # original: 0.1

trains just fine, I have no eval_matthews_correlation = 0 during training.

gpucce commented 1 year ago

@fxmarty asking as I can´t really get glue as good ad in the paper, if you have run also other glue tasks, did you have to apply similar changes also in remaining tasks?

And also, do I understand correctly that you had to reduce bath size from 4 * 8 = 32 to 8, considering gpus?

xijiu9 commented 1 year ago

I am facing the similar problem: when I set num_gpu=2 and add gradient_accumulation_steps=4 (which makes the batch size still 32), the average of 5 random seeds on CoLA of roberta-large using LoRA is 67.0. This results are "the result for each run is taken from the best epoch".

nbasyl commented 12 months ago

Did anyone know the solution? I am assuming the setting per_device_train_batch_size = 4 on a single GPU is equivalent to total batch size = 4 which is the paper setting, but I am still getting matthews_correlation = 0 during evaluation.

MaeChd commented 5 days ago

Did anyone know the solution? I am assuming the setting per_device_train_batch_size = 4 on a single GPU is equivalent to total batch size = 4 which is the paper setting, but I am still getting matthews_correlation = 0 during evaluation.

Hello, I also encountered the same problem. Did you finally solve it?