OOM: Ran out of memory with exception || Every time I restart pre-training from last checkpoint

Hi,

I am training on a system which has a time limit of 10 hours. So, every time I restart pre-training from last checkpoint, I get OOM error while it was running properly during the previous run with same configuration.

2022-07-23 15:15:37 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 2.29 GiB (GPU 2; 31.75 GiB total capacity; 28.47 GiB already allocated; 1.09 GiB free; 29.63 GiB reserved in total by PyTorch)

As a hack I reduce MAX_TOKENS by 512 every time and then it works. But now I've reached a point where I cannot reduce the MAX_TOKENS further, but still need to train my model further.

Also I've, noticed just one GPU goes OOM. Actually, I've tried to read it online, the cause is Distributed Data Parallel, tries to load all the data to one GPU and then distributes the load to the rest of the GPUs. But not sure how to deal with it.

Resources:

Total GPUs=8; Tesla V100-SXM2-32GB; total memory = 31.749 GB each;

My pretrain.sh is as follows:

MAX_UPDATE=100000
WARMUP_UPDATES=2000
MAX_SENTENCES=64
MAX_TOKENS=2048
TOKENS_PER_SAMPLE=512
UPDATE_FREQ=60

export CUDA_VISIBLE_DEVICES=$1
fairseq-train $DATA_DIR \
    --add-lang-token \
    --langs $langs \
    --dataset-impl 'mmap' \
    --bpe 'sentencepiece' \
    --sentencepiece-model $SPM_MODEL \
    --arch mbart_base \
    --tokens-per-sample $TOKENS_PER_SAMPLE \
    --max-tokens $MAX_TOKENS \
    --max-sentences $MAX_SENTENCES \
    --update-freq $UPDATE_FREQ \
    --layernorm-embedding \
    --multilang-sampling-alpha 0.3 \
    --train-subset train \
    --valid-subset valid \
    --required-batch-size-multiple 8 \
    --insert 0 \
    --permute-sentences 0 \
    --poisson-lambda 3.5 \
    --mask 0.3 \
    --mask-length 'span-poisson' \
    --replace-length 1 \
    --rotate 0 \
    --mask-random 0.1 \
    --task multilingual_denoising \
    --criterion cross_entropy \
    --dropout 0.1 \
    --attention-dropout 0.1 \
    --relu-dropout 0.0 \
    --weight-decay 0.01 \
    --optimizer adam \
    --adam-eps 1e-06 \
    --clip-norm 0.1 \
    --lr 3e-4 \
    --lr-scheduler polynomial_decay \
    --warmup-updates $WARMUP_UPDATES \
    --total-num-update $MAX_UPDATE \
    --max-update $MAX_UPDATE \
    --fp16 \
    --ddp-backend no_c10d \
    --no-epoch-checkpoints \
    --save-interval-updates 1000 \
    --keep-interval-updates 10 \
    --save-dir $SAVE_DIR \
    --skip-invalid-size-inputs-valid-test \
    --log-format json \
    --log-interval 10 \
    --num-workers 40 \
    --seed 1234 \
    --keep-last-epochs 20 \
    --patience 24 \
    --restore-file $SAVE_DIR/checkpoint_last.pt \
    --tensorboard-logdir $TENSORBOARD_LOGDIR \
    2>&1 | tee $SAVE_DIR/output.log

Please suggest how to deal with it.

wasiahmad / PLBART

OOM: Ran out of memory with exception || Every time I restart pre-training from last checkpoint #41