Cuda OOM issue in finetuning

Hi, I am trying to finetune 160gb model with custom dataset with following command. But with smallest settings also it goes out of memory after some update runs. I also, remove --fp16 .. but I don't see any memory improvement.

I tries with max-source-positions as 512,768,1024, update-freq as 1, 2, 4,8, batch-size as 1,2,4,8. --fp16 enabled/disabled. When I remove this --tensorboard-logdir $TENSORBOARD_LOGDIR ... it works but can't go beyond batch-size of 2. So overall slow.

For multigpu run are there any other settings ?? I am wondering how it ran on 8 * NVIDIA V100 (16GB) GPUs .... with the settings given in ReadMe file. Let me know.

OS - Ubuntu 16.04 Cuda - 10 Machine - 4x T4 gpus (16gb each), aws g4dn.12xlarge instance Libraries pytorch-transformers==1.2.0 torch==1.4.0 fairseq==0.9.0

CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train --user-dir $USER_DIR --task translation_prophetnet 
--arch $ARCH --optimizer adam --adam-betas '(0.9, 0.999)' --clip-norm 0.1 --lr 0.00001 --min-lr 1e-09 
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 2048 --dropout 0.1 --attention-dropout 0.1 
--weight-decay 0.01 --criterion $CRITERION --label-smoothing 0.1 --update-freq 1  --max-tokens 1024 
--num-workers 40 --load-from-pretrained-model $PRETRAINED_MODEL --ddp-backend=no_c10d --max-epoch 10 
--max-source-positions 768 --max-target-positions 256 --skip-invalid-size-inputs-valid-test --save-dir $SAVE_DIR 
--keep-last-epochs 10 --tensorboard-logdir $TENSORBOARD_LOGDIR $DATA_DIR --skip-invalid-size-inputs-valid-test 
--save-interval-updates 1000 --batch-size 1

microsoft / ProphetNet

Cuda OOM issue in finetuning #27