Closed FFFFFzx closed 3 months ago
需要8卡A100, 320GB显存,全参用float32.
你执行的命令改为: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python pretraining.py
错误原因: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node 8 pretraining.py是数据并行,会每个卡加载全量参数,显存不够。
需要8卡A100, 320GB显存,全参用float32.
非常感谢您的回复!同为7B模型,llama只要120G显存,相差这么多吗?
Describe the bug
Please provide a clear and concise description of what the bug is. If applicable, add screenshots to help explain your problem, especially for visualization related problems. 我在单机8卡3090上全量预训练baichuan-7b,但总是爆显存 错误如下: Traceback (most recent call last): File "pretraining.py", line 780, in
main()
File "pretraining.py", line 706, in main
model = model.float()
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2576, in float
return super().float(*args)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 979, in float
return self._apply(lambda t: t.float() if t.is_floating_point() else t)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 979, in
return self._apply(lambda t: t.float() if t.is_floating_point() else t)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 4; 23.70 GiB total capacity; 22.16 GiB already allocated; 166.56 MiB free; 22.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
请问这是什么原因造成的?
还有请问我设置的参数正确吗?参数如下:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node 8 pretraining.py \
--model_type baichuan \
--model_name_or_path Baichuan/Baichuan-7B \
--train_file_dir ./data/pretrain \
--validation_file_dir ./data/pretrain \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--do_train \
--do_eval \
--use_peft False \
--seed 42 \
--max_train_samples 20000 \
--max_eval_samples 10 \
--num_train_epochs 1 \
--learning_rate 2e-4 \
--warmup_ratio 0.05 \
--weight_decay 0.01 \
--logging_strategy steps \
--logging_steps 10 \
--eval_steps 50 \
--evaluation_strategy steps \
--save_steps 500 \
--save_strategy steps \
--save_total_limit 13 \
--gradient_accumulation_steps 1 \
--preprocessing_num_workers 10 \
--block_size 512 \
--group_by_length True \
--output_dir outputs-pt-bloom-v1 \
--overwrite_output_dir \
--ddp_timeout 30000 \
--logging_first_step True \
--target_modules all \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--torch_dtype bfloat16 \
--device_map auto \
--report_to tensorboard \
--ddp_find_unused_parameters False \
--gradient_checkpointing True \