使用Badam多卡训练中途弹出Invalidate trace cache @ step 2: expected module 1, but got module 332并停止训练

66RomanReigns commented 1 day ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

][INFO|trainer.py:4117] 2024-11-29 00:18:59,105 >> Running Evaluation [INFO|trainer.py:4119] 2024-11-29 00:18:59,105 >> Num examples = 100 [INFO|trainer.py:4122] 2024-11-29 00:18:59,105 >> Batch size = 4 {'eval_loss': 0.22188051044940948, 'eval_runtime': 16.9525, 'eval_samples_per_second': 5.899, 'eval_steps_per_second': 0.767, 'epoch': 0.16} 5%|██ | 36/675 [05:01<1:58:38, 11.14s/it]Invalidate trace cache @ step 2: expected module 1, but got module 332

llamafactory version: 0.9.1.dev0
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python version: 3.11.0
PyTorch version: 2.5.1+cu124 (GPU)
Transformers version: 4.46.1
Datasets version: 3.1.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA L20
DeepSpeed version: 0.14.4
Bitsandbytes version: 0.44.1

Reproduction

model

model_name_or_path: Qwen2-VL-7B-Instruct

method

stage: sft do_train: true finetuning_type: full use_badam: true badam_mode: layer badam_switch_mode: ascending badam_switch_interval: 50 badam_verbose: 2

dataset

dataset: mire_train # video: mllm_video_demo template: qwen2_vl cutoff_len: 2048 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 8 val_size: 0.1

output

output_dir: saves/qwen2_vl-7b/fulll/sft logging_steps: 10 save_steps: 0.2 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: true #fuck ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 4 eval_strategy: steps eval_steps: 5

flash_attn: fa2 deepspeed: examples/deepspeed/ds_z3_offload_config.json

Expected behavior

请问这该如何解决呢，弹出这个信息后，训练陷入挂起状态，停止训练,谢谢解答!

Others

No response

Ledzy commented 2 hours ago

Is this reproducible? Or it ocassionally happens?

66RomanReigns commented 13 minutes ago

Is this reproducible? Or it ocassionally happens?

It seems that when I use deepspeed,it happens. Do you encounter it before? Thanks!

hiyouga / LLaMA-Factory