Open baibaiw5 opened 1 year ago
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
Title: [FEATURE]: Does Gemini Strategy cpu placement support llama 7b(2048) reward training on single A100?
I have change the following code to use cpu placement,and have a 1T CPU memory. GPU oom still occurs: strategy = GeminiStrategy(placement_policy='cpu')
Describe the feature
Hi, I use the following packages:
colossalai 0.3.1 torch 2.0.1 transformers 4.28.1
And the follwing command to run llama-7b on A100(80G)
torchrun --standalone --nproc_per_node=1 train_reward_model.py \ --strategy colossalai_gemini_cpu \ --model llama \ --pretrain /data/checkpoints/share_gpt_7b/checkpoint-1300-fp16 \ --dataset /data/projects/DeepSpeedExamples/applications/DeepSpeed-Chat \ --save_path /data/checkpoints/colossal_llama7b_rm_ckpt \ --max_epochs 10 \ --batch_size 1 \ --max_len 2048 \ --lora_rank 0 \ --loss_fn 'log_sig'\
OOM Error will occur:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 79.20 GiB total capacity; 76.06 GiB already allocated; 201.31 MiB free; 77.51 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
If i lower the max_len to 1000,it will runing ok.