RuntimeError: CUDA error: out of memory

ZouRuia commented 2 years ago

when i use bash examples/train_iwslt14.sh /u01/zourui/unilm/deltalm/tmp/iwslt14/iwslt14.bin /u01/zourui/unilm/deltalm/tmp/iwslt14/checkpoints /u01/zourui/unilm/deltalm/checkpoint/deltalm-base.pt have a problem.

data_bin=/u01/zourui/unilm/deltalm/tmp/iwslt14/iwslt14.bin
save_dir=/u01/zourui/unilm/deltalm/tmp/iwslt14/checkpoints
PRETRAINED_MODEL=/u01/zourui/unilm/deltalm/checkpoint/deltalm-base.pt
python train.py /u01/zourui/unilm/deltalm/tmp/iwslt14/iwslt14.bin --save-dir /u01/zourui/unilm/deltalm/tmp/iwslt14/checkpoints --arch deltalm_base --pretrained-deltalm-checkpoint /u01/zourui/unilm/deltalm/checkpoint/deltalm-base.pt --share-all-embeddings --max-source-positions 128 --max-target-positions 128 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --optimizer adam --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt --lr 1e-4 --warmup-init-lr 1e-07 --stop-min-lr 1e-09 --warmup-updates 4000 --max-update 4000 --max-epoch 10 --batch-size 1 --update-freq 1 --seed 1 --log-format simple --skip-invalid-size-inputs-valid-test --fp16 --eval-bleu --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' --eval-bleu-detok moses --eval-bleu-remove-bpe=sentencepiece --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric 2022-08-03 09:44:46 | INFO | fairseq.distributed.utils | distributed init (rank 8): tcp://localhost:15738 2022-08-03 09:44:47 | INFO | fairseq.distributed.utils | distributed init (rank 3): tcp://localhost:15738 2022-08-03 09:44:47 | INFO | fairseq.distributed.utils | distributed init (rank 6): tcp://localhost:15738 2022-08-03 09:44:47 | INFO | fairseq.distributed.utils | distributed init (rank 2): tcp://localhost:15738 2022-08-03 09:44:47 | INFO | fairseq.distributed.utils | distributed init (rank 0): tcp://localhost:15738 2022-08-03 09:44:47 | INFO | fairseq.distributed.utils | distributed init (rank 5): tcp://localhost:15738 2022-08-03 09:44:47 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 5 2022-08-03 09:44:47 | INFO | fairseq.distributed.utils | distributed init (rank 9): tcp://localhost:15738 2022-08-03 09:44:47 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 9 2022-08-03 09:44:47 | INFO | fairseq.distributed.utils | distributed init (rank 7): tcp://localhost:15738 2022-08-03 09:44:47 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 7 2022-08-03 09:44:47 | INFO | fairseq.distributed.utils | distributed init (rank 4): tcp://localhost:15738 2022-08-03 09:44:47 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 4 2022-08-03 09:44:47 | INFO | fairseq.distributed.utils | distributed init (rank 1): tcp://localhost:15738 2022-08-03 09:44:47 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1 2022-08-03 09:44:47 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 8 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 3 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 6 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 10 nodes. 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 10 nodes. 2022-08-03 09:44:48 | INFO | fairseq.distributed.utils | initialized host ubuntu-65 as rank 0 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 10 nodes. 2022-08-03 09:44:48 | INFO | fairseq.distributed.utils | initialized host ubuntu-65 as rank 4 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Rank 8: Completed store-based barrier for key:store_based_barrier_key:1 with 10 nodes. 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 10 nodes. 2022-08-03 09:44:48 | INFO | fairseq.distributed.utils | initialized host ubuntu-65 as rank 1 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 10 nodes. 2022-08-03 09:44:48 | INFO | fairseq.distributed.utils | initialized host ubuntu-65 as rank 5 2022-08-03 09:44:48 | INFO | fairseq.distributed.utils | initialized host ubuntu-65 as rank 8 2022-08-03 09:44:48 | INFO | fairseq.distributed.utils | initialized host ubuntu-65 as rank 6 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Rank 9: Completed store-based barrier for key:store_based_barrier_key:1 with 10 nodes. 2022-08-03 09:44:48 | INFO | fairseq.distributed.utils | initialized host ubuntu-65 as rank 9 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 10 nodes. 2022-08-03 09:44:48 | INFO | fairseq.distributed.utils | initialized host ubuntu-65 as rank 7 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 10 nodes. 2022-08-03 09:44:48 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 10 nodes. 2022-08-03 09:44:48 | INFO | fairseq.distributed.utils | initialized host ubuntu-65 as rank 2 2022-08-03 09:44:48 | INFO | fairseq.distributed.utils | initialized host ubuntu-65 as rank 3 Traceback (most recent call last): File "train.py", line 11, in cli_main() File "/u01/zourui/unilm/deltalm/fairseq/fairseq_cli/train.py", line 509, in cli_main distributed_utils.call_main(cfg, main) File "/u01/zourui/unilm/deltalm/fairseq/fairseq/distributed/utils.py", line 346, in call_main torch.multiprocessing.spawn( File "/u01/zourui/anaconda3/envs/translation/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/u01/zourui/anaconda3/envs/translation/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/u01/zourui/anaconda3/envs/translation/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/u01/zourui/anaconda3/envs/translation/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/u01/zourui/unilm/deltalm/fairseq/fairseq/distributed/utils.py", line 324, in distributed_main cfg.distributed_training.distributed_rank = distributed_init(cfg) File "/u01/zourui/unilm/deltalm/fairseq/fairseq/distributed/utils.py", line 276, in distributed_init dist.all_reduce(torch.zeros(1).cuda()) RuntimeError: CUDA error: out of memory

Python 3.8 torch 1.12.0+cu116 torchaudio 0.12.0+cu116 torchvision 0.13.0+cu116 How can I solve it?

ZouRuia commented 2 years ago

set -ex

data_bin=$1 save_dir=$2 PRETRAINED_MODEL=$3

python train.py $data_bin \ --save-dir $save_dir \ --arch deltalm_base \ --pretrained-deltalm-checkpoint $PRETRAINED_MODEL \ --share-all-embeddings \ --max-source-positions 128 --max-target-positions 128 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --optimizer adam --adam-betas '(0.9, 0.98)' \ --lr-scheduler inverse_sqrt \ --lr 1e-4 \ --warmup-init-lr 1e-07 \ --stop-min-lr 1e-09 \ --warmup-updates 4000 \ --max-update 4000 \ --max-epoch 10 \ --batch-size 1 \ --update-freq 1 \ --seed 1 \ --log-format simple \ --skip-invalid-size-inputs-valid-test \ --fp16 \ --eval-bleu \ --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \ --eval-bleu-detok moses \ --eval-bleu-remove-bpe=sentencepiece \ --eval-bleu-print-samples \ --best-checkpoint-metric bleu --maximize-best-checkpoint-metric

ZouRuia commented 2 years ago

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:33:00.0 Off | N/A | | 51% 40C P2 129W / 350W | 16627MiB / 24576MiB | 16% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:34:00.0 Off | N/A | | 50% 38C P2 124W / 350W | 23000MiB / 24576MiB | 15% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... Off | 00000000:35:00.0 Off | N/A | | 51% 31C P2 110W / 350W | 24261MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... Off | 00000000:36:00.0 Off | N/A | | 54% 41C P2 128W / 350W | 15450MiB / 24576MiB | 16% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA GeForce ... Off | 00000000:37:00.0 Off | N/A | | 48% 41C P2 147W / 350W | 21509MiB / 24576MiB | 47% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA GeForce ... Off | 00000000:B3:00.0 Off | N/A | | 52% 45C P2 150W / 350W | 16623MiB / 24576MiB | 50% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA GeForce ... Off | 00000000:B4:00.0 Off | N/A | | 50% 40C P2 150W / 350W | 11045MiB / 24576MiB | 42% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA GeForce ... Off | 00000000:B5:00.0 Off | N/A | | 50% 41C P2 157W / 350W | 11045MiB / 24576MiB | 19% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 8 NVIDIA GeForce ... Off | 00000000:B6:00.0 Off | N/A | | 51% 41C P2 159W / 350W | 11045MiB / 24576MiB | 37% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 9 NVIDIA GeForce ... Off | 00000000:B7:00.0 Off | N/A | | 50% 17C P8 18W / 350W | 12189MiB / 24576MiB | 0% Default | | | | N/A |

I use gpu 9,but have this problem..

RobertBoganKang commented 2 years ago

I tried a lot, and find the answer. Reference from https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification. We should add with torch.no_grad(): when extracting features.

For example:

with torch.no_grad():
    feature = <<extract_feature_model>>(audio)

microsoft / unilm

RuntimeError: CUDA error: out of memory #814