wjx-sudo commented 2 months ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

model

model_name_or_path: /Qwen2-VL-7B-Instruct

method

stage: sft do_train: true finetuning_type: full train_mm_proj_only: true #训练多模态投影器 deepspeed: examples/deepspeed/ds_z2_config.json

dataset

dataset: mllm_demo,identity template: qwen2_vl cutoff_len: 1024 max_samples: 1000 overwrite_cache: true

preprocessing_num_workers: 16

output

output_dir: saves/qwen2_vl-7b/full/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500

Reproduction

File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/accelerate/accelerator.py", line 2143, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 166, in backward self.engine.backward(loss, kwargs) File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/torch/autograd/init.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Expected behavior

No response

Others

No response

nemonameless commented 1 month ago

发现 finetuning_type: full 也不是100%都训

GoGoZeppeli-towa commented 1 month ago

发现 finetuning_type: full 也不是100%都训

或许是freeze_vision_tower默认为true的原因？

wjx-sudo commented 1 month ago

发现 finetuning_type: full 也不是100%都训

确实我尝试把vision_tower部分参数也加进去，训练过程中会卡住，只能微调llm部分

nemonameless commented 1 month ago

freeze_vision_tower 设置为true发现在自己数据集上训的不如false高

will-wiki commented 1 month ago

@wjx-sudo 同样的问题，非流式训练llm-lora+merger的时候，只训练一个step就卡主了，想问下你解决了吗

piDack commented 1 month ago

坐等好心人解决方案

Michael4933 commented 1 week ago

llama-factory对VIT和connector的训练支持似乎确实没做太好，好像就是不支持

hiyouga / LLaMA-Factory

只全参数微调Qwen2-VL-7B-Instruct的visual.merger部分，冻结其他模型参数，训练过程报错 #5472

Reminder

System Info

model

method

dataset

preprocessing_num_workers: 16

output

train

eval

Reproduction

Expected behavior

Others