Closed arit2 closed 1 month ago
I encounter the same issue when using DPO to fine-tune qwen2-vl. Here is my environment:
- `llamafactory` version: 0.9.1.dev0
- Platform: Linux-6.6.13-1-lts-x86_64-with-glibc2.31
- Python version: 3.11.9
- PyTorch version: 2.4.0+cu121
- Transformers version: 4.45.0.dev0
- Datasets version: 2.21.0
- Accelerate version: 0.34.2
- PEFT version: 0.12.0
- TRL version: 0.9.6
I also encounter the same issue, the question seems to be casued by 'enable_liger_kernel: true', I want to use this parameter to reduce the memory footprint.
fixed
how to fix it
Reminder
System Info
LLaMA Factory, version 0.9.1.dev0 liger_kernel 0.3.0 transformers 4.45.0.dev0
Reproduction
llamafactory-cli train ./examples/train_lora/qwen2vl_loraplus_dpo_2b_20_09.yaml
model
model_name_or_path: Qwen/Qwen2-VL-2B-Instruct
method
stage: dpo do_train: true finetuning_type: lora lora_target: all pref_beta: 0.3 pref_loss: sigmoid
dataset
dataset: obrazy_rlhf_v__proba buffer_size: 1 preprocessing_batch_size: 1 streaming: true val_size: 260
accelerator_config:
dispatch_batches: false
template: qwen2_vl cutoff_len: 2748
max_samples: 1000
overwrite_cache: true preprocessing_num_workers: 1
output
output_dir: saves/qwen2_vl-2b_loraplus/25v1_beta0_5_orig logging_steps: 500 save_steps: 500 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 1 gradient_checkpointing: true gradient_accumulation_steps: 1 learning_rate: 5.0e-6 num_train_epochs: 3.0
flash_attn: auto lr_scheduler_type: cosine max_grad_norm: 1.0 loraplus_lr_ratio: 16.0 enable_liger_kernel: true
warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000 max_steps: 2200
eval
per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 200
Expected behavior
Unfortunately, running the training with liger kernel causes the following error:
My liger_kernel 0.3.0 llamafactory 0.9.1.dev0 transformers 4.45.0.dev0
09/25/2024 12:07:58 - INFO - llamafactory.model.model_utils.liger_kernel - Liger kernel has been applied to the model. 09/25/2024 12:07:58 - INFO - llamafactory.model.model_utils.liger_kernel - Liger kernel has been applied to the model. [INFO|modeling_utils.py:3702] 2024-09-25 12:07:58,644 >> loading weights file model.safetensors from cache at /home/python/.cache/huggingface/hub/models--Qwen--Qwen2-VL-2B-Instruct/snapshots/aca78372505e6cb469c4fa6a35c60265b00ff5a4/model.safetensors.index.json [INFO|modeling_utils.py:1621] 2024-09-25 12:07:58,653 >> Instantiating Qwen2VLForConditionalGeneration model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1097] 2024-09-25 12:07:58,654 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 }
[WARNING|logging.py:328] 2024-09-25 12:07:58,688 >> Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46 Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.88s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.88s/it] [INFO|modeling_utils.py:4544] 2024-09-25 12:08:10,541 >> All model checkpoint weights were used when initializing Qwen2VLForConditionalGeneration.
[INFO|modeling_utils.py:4552] 2024-09-25 12:08:10,541 >> All the weights of Qwen2VLForConditionalGeneration were initialized from the model checkpoint at Qwen/Qwen2-VL-2B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2VLForConditionalGeneration for predictions without further training. [INFO|configuration_utils.py:1052] 2024-09-25 12:08:10,685 >> loading configuration file generation_config.json from cache at /home/python/.cache/huggingface/hub/models--Qwen--Qwen2-VL-2B-Instruct/snapshots/aca78372505e6cb469c4fa6a35c60265b00ff5a4/generation_config.json [INFO|configuration_utils.py:1097] 2024-09-25 12:08:10,685 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "temperature": 0.01, "top_k": 1, "top_p": 0.001 }
09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled. 09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled. 09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference. 09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32. 09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference. 09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA 09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32. 09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA 09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.misc - Found linear modules: o_proj,down_proj,q_proj,k_proj,gate_proj,up_proj,v_proj 09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.misc - Found linear modules: q_proj,v_proj,o_proj,gate_proj,down_proj,k_proj,up_proj 09/25/2024 12:08:11 - INFO - llamafactory.model.loader - trainable params: 9,232,384 || all params: 2,218,217,984 || trainable%: 0.4162 09/25/2024 12:08:11 - INFO - llamafactory.model.loader - trainable params: 9,232,384 || all params: 2,218,217,984 || trainable%: 0.4162 max_steps is given, it will override any value given in num_train_epochs [WARNING|trainer.py:617] 2024-09-25 12:08:11,039 >> max_steps is given, it will override any value given in num_train_epochs [INFO|trainer.py:667] 2024-09-25 12:08:11,039 >> Using auto half precision backend 09/25/2024 12:08:11 - INFO - llamafactory.train.trainer_utils - Using LoRA+ optimizer with loraplus lr ratio 16.00. 09/25/2024 12:08:11 - INFO - llamafactory.train.trainer_utils - Using LoRA+ optimizer with loraplus lr ratio 16.00. [INFO|trainer.py:2212] 2024-09-25 12:08:13,575 >> Running training [INFO|trainer.py:2213] 2024-09-25 12:08:13,575 >> Num examples = 4,400 [INFO|trainer.py:2214] 2024-09-25 12:08:13,575 >> Num Epochs = 9,223,372,036,854,775,807 [INFO|trainer.py:2215] 2024-09-25 12:08:13,575 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2218] 2024-09-25 12:08:13,575 >> Total train batch size (w. parallel, distributed & accumulation) = 2 [INFO|trainer.py:2219] 2024-09-25 12:08:13,575 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2220] 2024-09-25 12:08:13,575 >> Total optimization steps = 2,200 [INFO|trainer.py:2221] 2024-09-25 12:08:13,578 >> Number of trainable parameters = 9,232,384 0%| | 0/2200 00:00<?, ?it/s: Traceback (most recent call last): rank0: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
rank0: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
rank0: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 56, in run_exp rank0: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks) rank0: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 81, in run_dpo rank0: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
rank0: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2021, in train rank0: return inner_training_loop(
rank0: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2357, in _inner_training_loop rank0: tr_loss_step = self.training_step(model, inputs)
rank0: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 3454, in training_step rank0: loss = self.compute_loss(model, inputs)
rank0: File "/home/python/factory/env/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py", line 1408, in compute_loss rank0: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
rank0: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 232, in get_batch_loss_metrics rank0: ) = self.concatenated_forward(model, batch)
rank0: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 182, in concatenated_forward
rank1: Traceback (most recent call last): rank1: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
rank1: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
rank1: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 56, in run_exp rank1: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks) rank1: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 81, in run_dpo rank1: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
rank1: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2021, in train rank1: return inner_training_loop(
rank1: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2357, in _inner_training_loop rank1: tr_loss_step = self.training_step(model, inputs)
rank1: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 3454, in training_step rank1: loss = self.compute_loss(model, inputs)
rank1: File "/home/python/factory/env/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py", line 1408, in compute_loss rank1: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
rank1: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 232, in get_batch_loss_metrics rank1: ) = self.concatenated_forward(model, batch)
rank1: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 182, in concatenated_forward
0%| | 0/2200 [00:13<?, ?it/s] E0925 12:08:30.915000 140353497219136 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3061541) of binary: /home/python/factory/env/bin/python3 Traceback (most recent call last): File "/home/python/factory/env/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args))
Others
No response