全量预训练baichuan-7b Out of memory

FFFFFzx commented 3 months ago

Describe the bug

Please provide a clear and concise description of what the bug is. If applicable, add screenshots to help explain your problem, especially for visualization related problems. 我在单机8卡3090上全量预训练baichuan-7b,但总是爆显存错误如下： Traceback (most recent call last): File "pretraining.py", line 780, in main() File "pretraining.py", line 706, in main model = model.float() File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2576, in float return super().float(*args) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 979, in float return self._apply(lambda t: t.float() if t.is_floating_point() else t) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) [Previous line repeated 2 more times] File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 820, in _apply param_applied = fn(param) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 979, in return self._apply(lambda t: t.float() if t.is_floating_point() else t) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 4; 23.70 GiB total capacity; 22.16 GiB already allocated; 166.56 MiB free; 22.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 请问这是什么原因造成的？还有请问我设置的参数正确吗？参数如下： CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node 8 pretraining.py \ --model_type baichuan \ --model_name_or_path Baichuan/Baichuan-7B \ --train_file_dir ./data/pretrain \ --validation_file_dir ./data/pretrain \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --do_train \ --do_eval \ --use_peft False \ --seed 42 \ --max_train_samples 20000 \ --max_eval_samples 10 \ --num_train_epochs 1 \ --learning_rate 2e-4 \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 13 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 10 \ --block_size 512 \ --group_by_length True \ --output_dir outputs-pt-bloom-v1 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype bfloat16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True \

shibing624 commented 3 months ago

需要8卡A100， 320GB显存，全参用float32.

shibing624 commented 3 months ago

你执行的命令改为： CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python pretraining.py

错误原因： CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node 8 pretraining.py是数据并行，会每个卡加载全量参数，显存不够。

FFFFFzx commented 3 months ago

需要8卡A100， 320GB显存，全参用float32.

非常感谢您的回复！同为7B模型，llama只要120G显存，相差这么多吗？

shibing624 / MedicalGPT

全量预训练baichuan-7b Out of memory #338

Describe the bug