[BUG]: On the eight-card A100, testing the 'examples/language/llama2' with the 'gemini_auto' plugin resulted in an 'out of memory' error."

🐛 Describe the bug

Here are my script, it can run with hybrid_parallel plugin, but other plugins have the same error "out of memory" torchrun --standalone --nproc_per_node 8 finetune.py \ --plugin "gemini_auto" \ --dataset "self_instruct" \ --model_path "Llama2-Chinese-7b-Chat" \ --task_name "finetuning" \ --batch_size 2 \ --save_dir "output_test"

Environment

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 4; 79.21 GiB total capacity; 75.40 GiB already allocated; 1.74 GiB free; 76.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

hpcaitech / ColossalAI

[BUG]: On the eight-card A100, testing the 'examples/language/llama2' with the 'gemini_auto' plugin resulted in an 'out of memory' error." #5030

🐛 Describe the bug

Environment