hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
31.55k stars 3.88k forks source link

Liger kernel brake fine tuning #5542

Open arit2 opened 2 days ago

arit2 commented 2 days ago

Reminder

System Info

LLaMA Factory, version 0.9.1.dev0 liger_kernel 0.3.0 transformers 4.45.0.dev0

Reproduction

llamafactory-cli train ./examples/train_lora/qwen2vl_loraplus_dpo_2b_20_09.yaml

model

model_name_or_path: Qwen/Qwen2-VL-2B-Instruct

method

stage: dpo do_train: true finetuning_type: lora lora_target: all pref_beta: 0.3 pref_loss: sigmoid

dataset

dataset: obrazy_rlhf_v__proba buffer_size: 1 preprocessing_batch_size: 1 streaming: true val_size: 260

accelerator_config:

dispatch_batches: false

template: qwen2_vl cutoff_len: 2748

max_samples: 1000

overwrite_cache: true preprocessing_num_workers: 1

output

output_dir: saves/qwen2_vl-2b_loraplus/25v1_beta0_5_orig logging_steps: 500 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_checkpointing: true gradient_accumulation_steps: 1 learning_rate: 5.0e-6 num_train_epochs: 3.0

flash_attn: auto lr_scheduler_type: cosine max_grad_norm: 1.0 loraplus_lr_ratio: 16.0 enable_liger_kernel: true

warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000 max_steps: 2200

eval

per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 200

Expected behavior

Unfortunately, running the training with liger kernel causes the following error:

My liger_kernel 0.3.0 llamafactory 0.9.1.dev0 transformers 4.45.0.dev0

09/25/2024 12:07:58 - INFO - llamafactory.model.model_utils.liger_kernel - Liger kernel has been applied to the model. 09/25/2024 12:07:58 - INFO - llamafactory.model.model_utils.liger_kernel - Liger kernel has been applied to the model. [INFO|modeling_utils.py:3702] 2024-09-25 12:07:58,644 >> loading weights file model.safetensors from cache at /home/python/.cache/huggingface/hub/models--Qwen--Qwen2-VL-2B-Instruct/snapshots/aca78372505e6cb469c4fa6a35c60265b00ff5a4/model.safetensors.index.json [INFO|modeling_utils.py:1621] 2024-09-25 12:07:58,653 >> Instantiating Qwen2VLForConditionalGeneration model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1097] 2024-09-25 12:07:58,654 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 }

[WARNING|logging.py:328] 2024-09-25 12:07:58,688 >> Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46 Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.88s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.88s/it] [INFO|modeling_utils.py:4544] 2024-09-25 12:08:10,541 >> All model checkpoint weights were used when initializing Qwen2VLForConditionalGeneration.

[INFO|modeling_utils.py:4552] 2024-09-25 12:08:10,541 >> All the weights of Qwen2VLForConditionalGeneration were initialized from the model checkpoint at Qwen/Qwen2-VL-2B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2VLForConditionalGeneration for predictions without further training. [INFO|configuration_utils.py:1052] 2024-09-25 12:08:10,685 >> loading configuration file generation_config.json from cache at /home/python/.cache/huggingface/hub/models--Qwen--Qwen2-VL-2B-Instruct/snapshots/aca78372505e6cb469c4fa6a35c60265b00ff5a4/generation_config.json [INFO|configuration_utils.py:1097] 2024-09-25 12:08:10,685 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "temperature": 0.01, "top_k": 1, "top_p": 0.001 }

09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled. 09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled. 09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference. 09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32. 09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference. 09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA 09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32. 09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA 09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.misc - Found linear modules: o_proj,down_proj,q_proj,k_proj,gate_proj,up_proj,v_proj 09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.misc - Found linear modules: q_proj,v_proj,o_proj,gate_proj,down_proj,k_proj,up_proj 09/25/2024 12:08:11 - INFO - llamafactory.model.loader - trainable params: 9,232,384 || all params: 2,218,217,984 || trainable%: 0.4162 09/25/2024 12:08:11 - INFO - llamafactory.model.loader - trainable params: 9,232,384 || all params: 2,218,217,984 || trainable%: 0.4162 max_steps is given, it will override any value given in num_train_epochs [WARNING|trainer.py:617] 2024-09-25 12:08:11,039 >> max_steps is given, it will override any value given in num_train_epochs [INFO|trainer.py:667] 2024-09-25 12:08:11,039 >> Using auto half precision backend 09/25/2024 12:08:11 - INFO - llamafactory.train.trainer_utils - Using LoRA+ optimizer with loraplus lr ratio 16.00. 09/25/2024 12:08:11 - INFO - llamafactory.train.trainer_utils - Using LoRA+ optimizer with loraplus lr ratio 16.00. [INFO|trainer.py:2212] 2024-09-25 12:08:13,575 >> Running training [INFO|trainer.py:2213] 2024-09-25 12:08:13,575 >> Num examples = 4,400 [INFO|trainer.py:2214] 2024-09-25 12:08:13,575 >> Num Epochs = 9,223,372,036,854,775,807 [INFO|trainer.py:2215] 2024-09-25 12:08:13,575 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2218] 2024-09-25 12:08:13,575 >> Total train batch size (w. parallel, distributed & accumulation) = 2 [INFO|trainer.py:2219] 2024-09-25 12:08:13,575 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2220] 2024-09-25 12:08:13,575 >> Total optimization steps = 2,200 [INFO|trainer.py:2221] 2024-09-25 12:08:13,578 >> Number of trainable parameters = 9,232,384 0%| | 0/2200 00:00<?, ?it/s: Traceback (most recent call last): rank0: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in

rank0: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch

rank0: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 56, in run_exp rank0: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks) rank0: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 81, in run_dpo rank0: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)

rank0: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2021, in train rank0: return inner_training_loop(

rank0: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2357, in _inner_training_loop rank0: tr_loss_step = self.training_step(model, inputs)

rank0: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 3454, in training_step rank0: loss = self.compute_loss(model, inputs)

rank0: File "/home/python/factory/env/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py", line 1408, in compute_loss rank0: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")

rank0: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 232, in get_batch_loss_metrics rank0: ) = self.concatenated_forward(model, batch)

rank0: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 182, in concatenated_forward

rank1: Traceback (most recent call last): rank1: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in

rank1: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch

rank1: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 56, in run_exp rank1: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks) rank1: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 81, in run_dpo rank1: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)

rank1: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2021, in train rank1: return inner_training_loop(

rank1: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2357, in _inner_training_loop rank1: tr_loss_step = self.training_step(model, inputs)

rank1: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 3454, in training_step rank1: loss = self.compute_loss(model, inputs)

rank1: File "/home/python/factory/env/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py", line 1408, in compute_loss rank1: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")

rank1: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 232, in get_batch_loss_metrics rank1: ) = self.concatenated_forward(model, batch)

rank1: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 182, in concatenated_forward

0%| | 0/2200 [00:13<?, ?it/s] E0925 12:08:30.915000 140353497219136 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3061541) of binary: /home/python/factory/env/bin/python3 Traceback (most recent call last): File "/home/python/factory/env/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args))

Others

No response

d223302 commented 22 hours ago

I encounter the same issue when using DPO to fine-tune qwen2-vl. Here is my environment:

- `llamafactory` version: 0.9.1.dev0
- Platform: Linux-6.6.13-1-lts-x86_64-with-glibc2.31
- Python version: 3.11.9
- PyTorch version: 2.4.0+cu121
- Transformers version: 4.45.0.dev0
- Datasets version: 2.21.0
- Accelerate version: 0.34.2
- PEFT version: 0.12.0
- TRL version: 0.9.6