shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Apache License 2.0
2.94k stars 452 forks source link

大佬,使用多卡3090跑baichuan2-13b时,感觉模型好像没有分布到各个显卡上,显存一下就满了oom了。怎么解决? #297

Closed tuqingwen closed 6 months ago

tuqingwen commented 6 months ago

Describe the bug

Please provide a clear and concise description of what the bug is. If applicable, add screenshots to help explain your problem, especially for visualization related problems. issue里看了别人提的相似问题,但是没看到解决方法 export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1

CUDA_VISIBLE_DEVICES=0,1,2 torchrun --nproc_per_node 3 pretraining.py \ --model_type baichuan \ --model_name_or_path ./Baichuan2-13B-Chat \ --train_file_dir ./data/pretrain_coal \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --do_train \ --do_eval \ --use_peft True \ --load_in_8bit True \ --seed 42 \ --num_train_epochs 4 \ --learning_rate 2e-4 \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 13 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 10 \ --block_size 512 \ --output_dir outputs-pt-Baichuan2-v1 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype bfloat16 \ --bf16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True \ --cache_dir ./cache 屏幕截图 2024-01-04 113347

shibing624 commented 6 months ago
CUDA_VISIBLE_DEVICES=0,1,2 python pretraining.py
--model_type baichuan
...