Open 66RomanReigns opened 1 day ago
][INFO|trainer.py:4117] 2024-11-29 00:18:59,105 >> Running Evaluation [INFO|trainer.py:4119] 2024-11-29 00:18:59,105 >> Num examples = 100 [INFO|trainer.py:4122] 2024-11-29 00:18:59,105 >> Batch size = 4 {'eval_loss': 0.22188051044940948, 'eval_runtime': 16.9525, 'eval_samples_per_second': 5.899, 'eval_steps_per_second': 0.767, 'epoch': 0.16} 5%|██ | 36/675 [05:01<1:58:38, 11.14s/it]Invalidate trace cache @ step 2: expected module 1, but got module 332
llamafactory
model_name_or_path: Qwen2-VL-7B-Instruct
stage: sft do_train: true finetuning_type: full use_badam: true badam_mode: layer badam_switch_mode: ascending badam_switch_interval: 50 badam_verbose: 2
dataset: mire_train # video: mllm_video_demo template: qwen2_vl cutoff_len: 2048 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 8 val_size: 0.1
output_dir: saves/qwen2_vl-7b/fulll/sft logging_steps: 10 save_steps: 0.2 plot_loss: true overwrite_output_dir: true
per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: true #fuck ddp_timeout: 180000000
val_size: 0.1 per_device_eval_batch_size: 4 eval_strategy: steps eval_steps: 5
flash_attn: fa2 deepspeed: examples/deepspeed/ds_z3_offload_config.json
请问这该如何解决呢,弹出这个信息后,训练陷入挂起状态,停止训练,谢谢解答!
No response
Is this reproducible? Or it ocassionally happens?
It seems that when I use deepspeed,it happens. Do you encounter it before? Thanks!
Reminder
System Info
][INFO|trainer.py:4117] 2024-11-29 00:18:59,105 >> Running Evaluation [INFO|trainer.py:4119] 2024-11-29 00:18:59,105 >> Num examples = 100 [INFO|trainer.py:4122] 2024-11-29 00:18:59,105 >> Batch size = 4 {'eval_loss': 0.22188051044940948, 'eval_runtime': 16.9525, 'eval_samples_per_second': 5.899, 'eval_steps_per_second': 0.767, 'epoch': 0.16} 5%|██ | 36/675 [05:01<1:58:38, 11.14s/it]Invalidate trace cache @ step 2: expected module 1, but got module 332
llamafactory
version: 0.9.1.dev0Reproduction
model
model_name_or_path: Qwen2-VL-7B-Instruct
method
stage: sft do_train: true finetuning_type: full use_badam: true badam_mode: layer badam_switch_mode: ascending badam_switch_interval: 50 badam_verbose: 2
dataset
dataset: mire_train # video: mllm_video_demo template: qwen2_vl cutoff_len: 2048 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 8 val_size: 0.1
output
output_dir: saves/qwen2_vl-7b/fulll/sft logging_steps: 10 save_steps: 0.2 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: true #fuck ddp_timeout: 180000000
eval
val_size: 0.1 per_device_eval_batch_size: 4 eval_strategy: steps eval_steps: 5
flash_attn: fa2 deepspeed: examples/deepspeed/ds_z3_offload_config.json
Expected behavior
请问这该如何解决呢,弹出这个信息后,训练陷入挂起状态,停止训练,谢谢解答!
Others
No response