Open xuyue1112 opened 1 month ago
Same issue here with many images in the sample with different resolution 1280...). The final finetuning resolution is 1024.
@hiyouga can you please have a look. This issue is annoying and any debug would help as your codes are tough to follow.
Thanks, Steve
same same!!!
Well it sees that llama-factory simply doesn't support it... https://github.com/hiyouga/LLaMA-Factory/issues/5657
Reminder
System Info
llamafactory
version: 0.9.1.dev0其中transformers使用了 21fac7abba2a37fae86106f87fcf9974fd1e3830 版本
Reproduction
运行参数 FORCE_TORCHRUN=1 NPROC_PER_NODE=$ARNOLD_WORKER_GPU NNODES=$ARNOLD_WORKER_NUM NODE_RANK=$ARNOLD_ID RANK=$ARNOLD_ID MASTER_ADDR=$METIS_WORKER_0_HOST MASTER_PORT=$port llamafactory-cli train xxx.yaml
yaml配置
method: stage: sft do_train: true finetuning_type: lora lora_target: all freeze_vision_tower: false
dataset: dataset: xxx eval_dataset: xxx template: qwen2_vl cutoff_len: 8192 overwrite_cache: true preprocessing_num_workers: 120
train: per_device_train_batch_size: 1 gradient_accumulation_steps: 16 learning_rate: 1.0e-4 num_train_epochs: 10.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000
find_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
, and by making sure allforward
function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforward
function. Please include the loss function and the structure of the return value offorward
of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 20: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 ... In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorExpected behavior
希望通过将vision encdoer加入sft,提升模型效果。 迭代1000多个step后,会遇到上述报错。 同样的数据、其它配置不变,如果不设置freeze_vision_tower=false的话,可以正常sft
Others
无