marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
257 stars 127 forks source link

Marian gets stuck before allocating memory on NVIDIA GeForce RTX 3090 #865

Closed fiqas closed 3 years ago

fiqas commented 3 years ago

Bug description

Marian gets stuck before allocating memory on NVIDIA GeForce RTX 3090 GPUs. I tested it on workspace 10GB and 20GB, it doesn't affect it, both are stuck. Neither disabling mini-batch-fit helps.

How to reproduce

I'm using my branch (fiqas/train_prune) with the code pulled from the compute86 branch.

./marian-pruned/build_86/marian -c student.tiny11tied.yml --model model.npz --train-sets ../train02.en.gz ../train02.de.gz -T tmp --shuffle-in-ram --pruning-type magnitude --pruning-start 0 --pruning-step 10000 --pruning-stop 400000 --pruning-skip-embeddings --pruning-sparsity 0.5 --vocabs vocab.spm vocab.spm --dim-vocabs 32000 32000 --max-length 200 --mini-batch-fit -w 10000 --mini-batch 1000 --maxi-batch 1000 --devices 0 1 2 3 --sync-sgd --cost-type ce-mean-words --learn-rate 0.0003 --lr-report --lr-warmup 16000 --lr-decay-inv-sqrt 32000 --optimizer-params 0.9 0.98 1e-09 --clip-norm 0 --valid-freq 5000 --save-freq 5000 --disp-freq 1000 --disp-first 10 --valid-metrics bleu-detok ce-mean-words --valid-sets devset.en devset.de --valid-translation-output devset.out --quiet-translation --valid-mini-batch 16 --beam-size 1 --normalize 1 --early-stopping 20 --keep-best --exponential-smoothing --log train.log --valid-log valid.log

Context

v1.10.19; 5cbcbfd 2021-05-04 10:10:55 +0000

cmake .. -DCOMPILE_TESTS=ON -DUSE_SENTENCEPIECE=ON -DCMAKE_BUILD_TYPE=Release

fatal: Not a git repository: (REDACTED_PATH).git/modules/src/3rd_party/fbgemm/modules/third_party/asmjit
Unable to find current revision in submodule path 'src/3rd_party/fbgemm/third_party/asmjit'
Failed to recurse into submodule path 'src/3rd_party/fbgemm'
-- Building with -march=native and intrinsics will be chosen automatically by the compiler to match the current machine.
-- Checking support for CPU intrinsics
-- Could not find hardware support for AVX512 on this machine.
CMake Warning at CMakeLists.txt:301 (message):
  On some Unix systems CUDA 10.0+ requires CMake 3.12.2+; you use CMake 3.5.1

-- Compiling code for Pascal GPUs
-- Compiling code for Volta GPUs
-- Compiling code for Turing GPUs
-- Compiling code for Ampere GPUs
-- Compiling code for Ampere RTX GPUs
-- Found CUDA libraries: /usr/local/cuda/lib64/libcurand.so;/usr/local/cuda/lib64/libcusparse.so;/usr/local/cuda/lib64/libcublas.so;/usr/local/cuda/lib64/libcublasLt.so
-- Found Tcmalloc: /usr/lib/libtcmalloc_minimal.so
CMake Warning at src/3rd_party/intgemm/CMakeLists.txt:33 (message):
  Not building AVX512VNNI-based multiplication because your compiler is
  too old.

  For details rerun cmake with --debug-trycompile then try to build in
  compile_tests/CMakeFiles/CMakeTmp.

-- VERSION: 0.1.94
-- Found TCMalloc: /usr/lib/libtcmalloc_minimal.so
-- Configuring done
-- Generating done
-- Build files have been written to: (REDACTED_PATH)/build_86
[2021-05-04 12:57:02] Using synchronous SGD
[2021-05-04 12:57:02] [comm] Compiled without MPI support. Running as a single process on alvis
[2021-05-04 12:57:02] Synced seed 1620133022
[2021-05-04 12:57:02] [data] Loading SentencePiece vocabulary from file vocab.spm
[2021-05-04 12:57:02] [data] Setting vocabulary size for input 0 to 32,000
[2021-05-04 12:57:02] [data] Loading SentencePiece vocabulary from file vocab.spm
[2021-05-04 12:57:02] [data] Setting vocabulary size for input 1 to 32,000

Hangs here. If you need fulls logs, I can also provide them.

kpu commented 3 years ago

Sorry @fiqas if you want to file a bug report you really need to run master. Even though I suspect it won't make a difference.

emjotde commented 3 years ago

Yes re: master. We seem to have a couple of issue on Ampere GPUs. I didn't have a chance to use any yet. This is coming soon (few weeks), but for as long as I don't have access to that hardware you are on your own.

fiqas commented 3 years ago

I can confirm it's happening on master too.

kpu commented 3 years ago

@emjotde Your valhalla account is still active.

emjotde commented 3 years ago

Let's not go there :)

XapaJIaMnu commented 3 years ago

Can i get access to your experiments? I just managed to start training on hrist.

fiqas commented 3 years ago

Yeah, it works on hrist, but wasn't on alvis. Will test later when GPUs on that machine are free.

XapaJIaMnu commented 3 years ago

tl;dr if you have a cuda version < 11.2, you might get very unpredictable/random crashes.