torch.distributed.launch on eight 40G A100, CUDA out of memory.

I run: export CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' task=gene datadir=data/$task outdir=runs/$task/GPT2 name=gene0913 checkpoint=/root/siton-glusterfs-eaxtsxdfs/xts/data/BioMedLM python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --use_env run_seqcls_gpt.py \ --tokenizer_name $checkpoint --model_name_or_path $checkpoint --train_file \ $datadir/train.json --validation_file $datadir/dev.json --test_file $datadir/test.json --do_train \ --do_eval --do_predict --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 \ --learning_rate 2e-6 --warmup_ratio 0.5 --num_train_epochs 5 --max_seq_length \ 32 --logging_steps 1 --save_strategy no --evaluation_strategy no --output_dir \ $outdir --overwrite_output_dir --bf16 --seed 1000 --run_name %name

but still get CUDA out of memory. Anyone know to finetune seqcls how many GPUs must be need?

stanford-crfm / BioMedLM

torch.distributed.launch on eight 40G A100, CUDA out of memory. #26