wasiahmad / PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].
https://arxiv.org/abs/2103.06333
MIT License
186 stars 34 forks source link

OOM: Ran out of memory with exception || Every time I restart pre-training from last checkpoint #41

Closed CosmoLuminous closed 2 years ago

CosmoLuminous commented 2 years ago

Hi,

I am training on a system which has a time limit of 10 hours. So, every time I restart pre-training from last checkpoint, I get OOM error while it was running properly during the previous run with same configuration.

2022-07-23 15:15:37 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 2.29 GiB (GPU 2; 31.75 GiB total capacity; 28.47 GiB already allocated; 1.09 GiB free; 29.63 GiB reserved in total by PyTorch)

As a hack I reduce MAX_TOKENS by 512 every time and then it works. But now I've reached a point where I cannot reduce the MAX_TOKENS further, but still need to train my model further.

Also I've, noticed just one GPU goes OOM. Actually, I've tried to read it online, the cause is Distributed Data Parallel, tries to load all the data to one GPU and then distributes the load to the rest of the GPUs. But not sure how to deal with it.

Resources:

Total GPUs=8; Tesla V100-SXM2-32GB; total memory = 31.749 GB each;

My pretrain.sh is as follows:

MAX_UPDATE=100000
WARMUP_UPDATES=2000
MAX_SENTENCES=64
MAX_TOKENS=2048
TOKENS_PER_SAMPLE=512
UPDATE_FREQ=60

export CUDA_VISIBLE_DEVICES=$1
fairseq-train $DATA_DIR \
    --add-lang-token \
    --langs $langs \
    --dataset-impl 'mmap' \
    --bpe 'sentencepiece' \
    --sentencepiece-model $SPM_MODEL \
    --arch mbart_base \
    --tokens-per-sample $TOKENS_PER_SAMPLE \
    --max-tokens $MAX_TOKENS \
    --max-sentences $MAX_SENTENCES \
    --update-freq $UPDATE_FREQ \
    --layernorm-embedding \
    --multilang-sampling-alpha 0.3 \
    --train-subset train \
    --valid-subset valid \
    --required-batch-size-multiple 8 \
    --insert 0 \
    --permute-sentences 0 \
    --poisson-lambda 3.5 \
    --mask 0.3 \
    --mask-length 'span-poisson' \
    --replace-length 1 \
    --rotate 0 \
    --mask-random 0.1 \
    --task multilingual_denoising \
    --criterion cross_entropy \
    --dropout 0.1 \
    --attention-dropout 0.1 \
    --relu-dropout 0.0 \
    --weight-decay 0.01 \
    --optimizer adam \
    --adam-eps 1e-06 \
    --clip-norm 0.1 \
    --lr 3e-4 \
    --lr-scheduler polynomial_decay \
    --warmup-updates $WARMUP_UPDATES \
    --total-num-update $MAX_UPDATE \
    --max-update $MAX_UPDATE \
    --fp16 \
    --ddp-backend no_c10d \
    --no-epoch-checkpoints \
    --save-interval-updates 1000 \
    --keep-interval-updates 10 \
    --save-dir $SAVE_DIR \
    --skip-invalid-size-inputs-valid-test \
    --log-format json \
    --log-interval 10 \
    --num-workers 40 \
    --seed 1234 \
    --keep-last-epochs 20 \
    --patience 24 \
    --restore-file $SAVE_DIR/checkpoint_last.pt \
    --tensorboard-logdir $TENSORBOARD_LOGDIR \
    2>&1 | tee $SAVE_DIR/output.log

Please suggest how to deal with it.

wasiahmad commented 2 years ago

We rarely meet this issue, so we didn't go deeper to investigate this issue. We believe the issue belongs to Fairseq.