Open wjx-sudo opened 2 months ago
发现 finetuning_type: full 也不是100%都训
发现 finetuning_type: full 也不是100%都训
或许是freeze_vision_tower默认为true的原因?
发现 finetuning_type: full 也不是100%都训
确实 我尝试把vision_tower部分参数也加进去,训练过程中会卡住,只能微调llm部分
freeze_vision_tower 设置为true发现在自己数据集上训的不如false高
@wjx-sudo 同样的问题,非流式训练llm-lora+merger的时候,只训练一个step就卡主了,想问下你解决了吗
坐等好心人解决方案
llama-factory对VIT和connector的训练支持似乎确实没做太好,好像就是不支持
Reminder
System Info
model
model_name_or_path: /Qwen2-VL-7B-Instruct
method
stage: sft do_train: true finetuning_type: full train_mm_proj_only: true #训练多模态投影器 deepspeed: examples/deepspeed/ds_z2_config.json
dataset
dataset: mllm_demo,identity template: qwen2_vl cutoff_len: 1024 max_samples: 1000 overwrite_cache: true
preprocessing_num_workers: 16
output
output_dir: saves/qwen2_vl-7b/full/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000
eval
val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500
Reproduction
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/accelerate/accelerator.py", line 2143, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 166, in backward self.engine.backward(loss, kwargs) File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/torch/autograd/init.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Expected behavior
No response
Others
No response