qwen2-vl微调，设置freeze_vision_tower=false后，训练一段时间会报错 - Githubissues

hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)

https://arxiv.org/abs/2403.13372

Apache License 2.0

34.97k stars 4.32k forks source link

qwen2-vl微调，设置freeze_vision_tower=false后，训练一段时间会报错 #5680

Open xuyue1112 opened 1 month ago

xuyue1112 commented 1 month ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-5.15.120.bsk.2-amd64-x86_64-with-glibc2.31
Python version: 3.11.2
PyTorch version: 2.4.0 (GPU)
Transformers version: 4.45.0.dev0
Datasets version: 2.21.0
Accelerate version: 0.34.2
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A800-SXM4-40GB
vLLM version: 0.6.1

其中transformers使用了 21fac7abba2a37fae86106f87fcf9974fd1e3830 版本

Reproduction

运行参数 FORCE_TORCHRUN=1 NPROC_PER_NODE=$ARNOLD_WORKER_GPU NNODES=$ARNOLD_WORKER_NUM NODE_RANK=$ARNOLD_ID RANK=$ARNOLD_ID MASTER_ADDR=$METIS_WORKER_0_HOST MASTER_PORT=$port llamafactory-cli train xxx.yaml
yaml配置

method： stage: sft do_train: true finetuning_type: lora lora_target: all freeze_vision_tower: false

dataset： dataset: xxx eval_dataset: xxx template: qwen2_vl cutoff_len: 8192 overwrite_cache: true preprocessing_num_workers: 120

train: per_device_train_batch_size: 1 gradient_accumulation_steps: 16 learning_rate: 1.0e-4 num_train_epochs: 10.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000

报错信息： Traceback (most recent call last): File "/opt/tiger/llama_factory/transformers-21fac7abba2a37fae86106f87fcf9974fd1e3830/src/transformers/trainer.py", line 2328, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/tiger/llama_factory/transformers-21fac7abba2a37fae86106f87fcf9974fd1e3830/src/transformers/trainer.py", line 3424, in training_step loss = self.compute_loss(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/tiger/llama_factory/transformers-21fac7abba2a37fae86106f87fcf9974fd1e3830/src/transformers/trainer.py", line 3471, in compute_loss outputs = model(inputs) ^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1632, in forward inputs, kwargs = self._pre_forward(*inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1523, in _pre_forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 20: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 ... In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Expected behavior

希望通过将vision encdoer加入sft，提升模型效果。迭代1000多个step后，会遇到上述报错。同样的数据、其它配置不变，如果不设置freeze_vision_tower=false的话，可以正常sft

Others

无

thusinh1969 commented 3 weeks ago

Same issue here with many images in the sample with different resolution 1280...). The final finetuning resolution is 1024.

thusinh1969 commented 3 weeks ago

@hiyouga can you please have a look. This issue is annoying and any debug would help as your codes are tough to follow.

Thanks, Steve

Michael4933 commented 2 weeks ago

same same!!!

Well it sees that llama-factory simply doesn't support it... https://github.com/hiyouga/LLaMA-Factory/issues/5657