大佬，使用多卡3090跑baichuan2-13b时，感觉模型好像没有分布到各个显卡上，显存一下就满了oom了。怎么解决？

Describe the bug

Please provide a clear and concise description of what the bug is. If applicable, add screenshots to help explain your problem, especially for visualization related problems. issue里看了别人提的相似问题，但是没看到解决方法 export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1

CUDA_VISIBLE_DEVICES=0,1,2 torchrun --nproc_per_node 3 pretraining.py \ --model_type baichuan \ --model_name_or_path ./Baichuan2-13B-Chat \ --train_file_dir ./data/pretrain_coal \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --do_train \ --do_eval \ --use_peft True \ --load_in_8bit True \ --seed 42 \ --num_train_epochs 4 \ --learning_rate 2e-4 \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 13 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 10 \ --block_size 512 \ --output_dir outputs-pt-Baichuan2-v1 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype bfloat16 \ --bf16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True \ --cache_dir ./cache 屏幕截图 2024-01-04 113347

shibing624 / MedicalGPT

大佬，使用多卡3090跑baichuan2-13b时，感觉模型好像没有分布到各个显卡上，显存一下就满了oom了。怎么解决？ #297

Describe the bug