hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
35.17k stars 4.35k forks source link

使用Badam多卡训练中途弹出Invalidate trace cache @ step 2: expected module 1, but got module 332并停止训练 #6184

Open 66RomanReigns opened 1 day ago

66RomanReigns commented 1 day ago

Reminder

System Info

][INFO|trainer.py:4117] 2024-11-29 00:18:59,105 >> Running Evaluation [INFO|trainer.py:4119] 2024-11-29 00:18:59,105 >> Num examples = 100 [INFO|trainer.py:4122] 2024-11-29 00:18:59,105 >> Batch size = 4 {'eval_loss': 0.22188051044940948, 'eval_runtime': 16.9525, 'eval_samples_per_second': 5.899, 'eval_steps_per_second': 0.767, 'epoch': 0.16} 5%|██ | 36/675 [05:01<1:58:38, 11.14s/it]Invalidate trace cache @ step 2: expected module 1, but got module 332

Reproduction

model

model_name_or_path: Qwen2-VL-7B-Instruct

method

stage: sft do_train: true finetuning_type: full use_badam: true badam_mode: layer badam_switch_mode: ascending badam_switch_interval: 50 badam_verbose: 2

dataset

dataset: mire_train # video: mllm_video_demo template: qwen2_vl cutoff_len: 2048 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 8 val_size: 0.1

output

output_dir: saves/qwen2_vl-7b/fulll/sft logging_steps: 10 save_steps: 0.2 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: true #fuck ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 4 eval_strategy: steps eval_steps: 5

flash_attn: fa2 deepspeed: examples/deepspeed/ds_z3_offload_config.json

Expected behavior

请问这该如何解决呢,弹出这个信息后,训练陷入挂起状态,停止训练,谢谢解答!

Others

No response

Ledzy commented 2 hours ago

Is this reproducible? Or it ocassionally happens?

66RomanReigns commented 13 minutes ago

Is this reproducible? Or it ocassionally happens?

It seems that when I use deepspeed,it happens. Do you encounter it before? Thanks!