加载checkpoint相关问题

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

1.在4090×2上训练的checkpoint，per_device_train_batch_size=2，gradient_accumulation_steps=8，换到A100 40G×4机器上，per_device_train_batch_size=8，gradient_accumulation_steps=8，每个epoch的step由原来的13000多缩小到了6000多，这是为什么，也不是线性关系啊，应该缩小到1/8才对。 2.A100的机器上的adapter模型再加载到4090上训练，报错 TypeError: TrainerState.init() got an unexpected keyword argument 'stateful_callbacks'。 3.我在做SFT的lora微调，用的int4量化，还有什么办法能加速吗，4*A100 40G，per_device_train_batch_size=8，每张卡大约占用20多G，设置到10就可能OOM。谢谢大佬回复。训练脚本

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
    /home/tom/fssd/LLaMA-Factory/src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path /home/tom/fssd/Baichuan2-Chat-v2 \
    --dataset juhe_self_config,CrimeKgAssitant_92k,legal_advice,legal_counsel_v2,CrimeKgAssitant_52k,zixun_gpt4,lawzhidao_filter,CAIL2018_EC_sentence_pred,DISC-Law-SFT-Pair-judgmentPred,CAIL2022_eventDet,prac_prob,DISC-Law-SFT-Triplet-released,DISC-Law-SFT-Pair,alpaca_gpt4_zh \
    --dataset_dir /home/tom/fssd/LLaMA-Factory/data \
    --template baichuan2 \
    --finetuning_type lora \
    --lora_target W_pack \
    --output_dir /home/tom/fssd/LLaMA-Factory/saves/Baichuan2-13B-Chat/lora/train_2024-05-17 \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 2000 \
    --eval_steps 2000 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 1e-4 \
    --num_train_epochs 6 \
    --max_samples 100000 \
    --val_size 0.1 \
    --ddp_timeout 180000000 \
    --repetition_penalty 1.2 \
    --plot_loss \
    --quantization_bit 4 \
    --fp16

Expected behavior

No response

System Info

No response

Others

No response

hiyouga / LLaMA-Factory

加载checkpoint相关问题 #3857

Reminder

Reproduction

Expected behavior

System Info

Others