microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.64k stars 2.51k forks source link

RuntimeError: CUDA error: out of memory #814

Open ZouRuia opened 2 years ago

ZouRuia commented 2 years ago

when i use bash examples/train_iwslt14.sh /u01/zourui/unilm/deltalm/tmp/iwslt14/iwslt14.bin /u01/zourui/unilm/deltalm/tmp/iwslt14/checkpoints /u01/zourui/unilm/deltalm/checkpoint/deltalm-base.pt have a problem.

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/u01/zourui/anaconda3/envs/translation/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/u01/zourui/unilm/deltalm/fairseq/fairseq/distributed/utils.py", line 324, in distributed_main cfg.distributed_training.distributed_rank = distributed_init(cfg) File "/u01/zourui/unilm/deltalm/fairseq/fairseq/distributed/utils.py", line 276, in distributed_init dist.all_reduce(torch.zeros(1).cuda()) RuntimeError: CUDA error: out of memory

ZouRuia commented 2 years ago

set -ex

data_bin=$1 save_dir=$2 PRETRAINED_MODEL=$3

python train.py $data_bin \ --save-dir $save_dir \ --arch deltalm_base \ --pretrained-deltalm-checkpoint $PRETRAINED_MODEL \ --share-all-embeddings \ --max-source-positions 128 --max-target-positions 128 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --optimizer adam --adam-betas '(0.9, 0.98)' \ --lr-scheduler inverse_sqrt \ --lr 1e-4 \ --warmup-init-lr 1e-07 \ --stop-min-lr 1e-09 \ --warmup-updates 4000 \ --max-update 4000 \ --max-epoch 10 \ --batch-size 1 \ --update-freq 1 \ --seed 1 \ --log-format simple \ --skip-invalid-size-inputs-valid-test \ --fp16 \ --eval-bleu \ --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \ --eval-bleu-detok moses \ --eval-bleu-remove-bpe=sentencepiece \ --eval-bleu-print-samples \ --best-checkpoint-metric bleu --maximize-best-checkpoint-metric

ZouRuia commented 2 years ago

image +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:33:00.0 Off | N/A | | 51% 40C P2 129W / 350W | 16627MiB / 24576MiB | 16% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:34:00.0 Off | N/A | | 50% 38C P2 124W / 350W | 23000MiB / 24576MiB | 15% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... Off | 00000000:35:00.0 Off | N/A | | 51% 31C P2 110W / 350W | 24261MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... Off | 00000000:36:00.0 Off | N/A | | 54% 41C P2 128W / 350W | 15450MiB / 24576MiB | 16% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA GeForce ... Off | 00000000:37:00.0 Off | N/A | | 48% 41C P2 147W / 350W | 21509MiB / 24576MiB | 47% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA GeForce ... Off | 00000000:B3:00.0 Off | N/A | | 52% 45C P2 150W / 350W | 16623MiB / 24576MiB | 50% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA GeForce ... Off | 00000000:B4:00.0 Off | N/A | | 50% 40C P2 150W / 350W | 11045MiB / 24576MiB | 42% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA GeForce ... Off | 00000000:B5:00.0 Off | N/A | | 50% 41C P2 157W / 350W | 11045MiB / 24576MiB | 19% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 8 NVIDIA GeForce ... Off | 00000000:B6:00.0 Off | N/A | | 51% 41C P2 159W / 350W | 11045MiB / 24576MiB | 37% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 9 NVIDIA GeForce ... Off | 00000000:B7:00.0 Off | N/A | | 50% 17C P8 18W / 350W | 12189MiB / 24576MiB | 0% Default | | | | N/A |

I use gpu 9,but have this problem..

RobertBoganKang commented 2 years ago

I tried a lot, and find the answer. Reference from https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification. We should add with torch.no_grad(): when extracting features.

For example:

with torch.no_grad():
    feature = <<extract_feature_model>>(audio)