[PEFT on Gaudi2C] speed of Full-parameter Finetuning is almost equal to that of LoRA

intelyoungway commented 4 months ago

Feature request

[Model] chinese-alpaca-2-7b
[Hardware] Gaudi2C
[Method] LoRA and FineTuning
[Related codes] examples/language_modeling
[Test Cmdlines]:

LoRA:

python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 \
run_lora_clm.py \
--model_name_or_path /workspace/chinese-alpaca-2-7b  \
--deepspeed llama2_ds_zero3_config.json \
--dataset_name tatsu-lab/alpaca  \
--bf16 True \
--output_dir /workspace/lora_out \
--overwrite_output_dir \
--num_train_epochs 2 \
--max_seq_len 2048 \
--per_device_train_batch_size 10 \
--per_device_eval_batch_size 10  \
--gradient_checkpointing  \
--evaluation_strategy epoch \
--eval_delay 2 \
--save_strategy no \
--learning_rate 0.0018 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--dataset_concatenation \
--attn_softmax_bf16 True \
--do_train  --do_eval \
--use_habana \
--use_lazy_mode \
--pipelining_fwd_bwd \
--throughput_warmup_steps 3 \
--lora_rank 4 \
--lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
--validation_split_percentage 4 \
--use_flash_attention True

FineTuning (which is modified by customer based on the run_lora_clm.py for finetuning, see attached tmp_finetune.zip):


python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 \
tmp_finetune.py \
--model_name_or_path /workspace/chinese-alpaca-2-7b \
--deepspeed llama2_ds_zero3_config.json \
--dataset_name tatsu-lab/alpaca \
--bf16 True \
--output_dir /workspace/lora_out --overwrite_output_dir \
--num_train_epochs 2 --max_seq_len 2048 --per_device_train_batch_size 10 \
--per_device_eval_batch_size 10 --gradient_checkpointing --evaluation_strategy epoch \
--eval_delay 2 --save_strategy no --learning_rate 0.0018 \
--warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 \
--dataset_concatenation --attn_softmax_bf16 True --do_train \
--use_habana \
--use_lazy_mode \
--pipelining_fwd_bwd \
--throughput_warmup_steps 3 \
--lora_rank 4 \
--lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
--validation_split_percentage 4 \
--use_flash_attention True


[tmp_finetune.zip](https://github.com/huggingface/optimum-habana/files/15219872/tmp_finetune.zip)

### Motivation

Customer found that, finetuning of full-parameter is 14 train-samples-per-second, which is similar to that of LoRA (16 train-samples-per-second).
Please see details in feature request and check if there is any possible way to optimize LoRA for better performance.

LeoZhao-Intel commented 4 months ago

can you attach training logs to ease analysis?

intelyoungway commented 4 months ago

Sure, I am asking for customer's feedback.

yafshar commented 3 months ago

@intelyoungway, the attached script is also doing LoRA finetuning. Would you clarify what is the exact issue/request?

intelyoungway commented 3 months ago

customer said they modified the original LoRA script to do finetune (see in the attached files). And the issue is, the speed is too close between LoRA and finetuning (their modified scripts), which is strange cause LoRA should be significantly faster in theory. So the solution is simple: (1) if the attached file is correct finetuning, then please provide a LoRA script optimized so that it could be significantly faster than finetuning. (2) if incorrect, then please feedback that the script that customer modified is not a correct implementation of finetuning, and I will tell customer this message and close the ticket.

yafshar commented 2 months ago

@intelyoungway, Thanks for the comment. From what you said, the goal is to compare the full parameter model fine-tuning with Lora fine-tuning.

As per the original Lora paper from Microsoft, https://arxiv.org/abs/2106.09685, it's theoretically understood that full parameter and Lora fine-tuning should not yield the same performance, mainly when low ranks are used in Lora. The disparity in the number of parameters used for training is a key factor here.

From the attached script, I see you are using the same run_lora_clm.py script with some minor modifications for both full parameters and Lora fine-tuning. If the performance is the same, the script might have an issue. It would help if you used run_clm.py for full parameter fine-tuning and run_lora_clm.py for Lora fine-tuning.

For me or anyone else to be able to help, I need more details, especially log files, number of parameters, etc.

intelyoungway commented 2 months ago

Thanks for the explanation. I think this can fulfill the need. The issue should be closed now.

huggingface / optimum-habana

[PEFT on Gaudi2C] speed of Full-parameter Finetuning is almost equal to that of LoRA #952

Feature request