marian-nmt / marian-examples

Examples, tutorials and use cases for Marian, including our WMT-2017/18 baselines.
Other
78 stars 34 forks source link

Error: Floating-point exception #16

Closed Supachan closed 5 years ago

Supachan commented 5 years ago

Hi, I just installed marian-nmt/marian and ran the code ./run-me.sh as default in traning-basics, I got error:

[2019-05-16 06:07:44] Error: CUDA error 2 'out of memory' - /tmp/marian/src/tensors/gpu/device.cu:38: cudaMalloc(&data_, size)
[2019-05-16 06:07:44] Error: Aborted from virtual void marian::gpu::Device::reserve(size_t) in /tmp/marian/src/tensors/gpu/device.cu:38

I checked --mini-batch-fit -w 3000 that might explode my GPU memory. So, I decreased the size of batch and added max-length:--mini-batch-fit -w 64 --max-length 100 in run-me.sh, but the error showed

[2019-05-16 07:12:23] Error: Floating-point exception
[2019-05-16 07:12:23] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /tmp/marian/src/common/logging.cpp:138

How can I fix the problem? Please let me know. Thank you in advance. Supachan

PS: my laptop DELL Inspiron 15 7000 Gaming:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   49C    P3    N/A /  N/A |    763MiB /  4040MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
snukky commented 5 years ago

A few comments here. -w sets the size of the workspace that is used to store and process mini-batches in MB. 64MB is too small, and 3000MB seems to be too much for this model type and GPU. However, the attached output of nvidia-smi suggests a corrupted context on the GPU as some process occupies 763MB and is not listed in the list of processes (see this post for more details). Free up this memory or use -w 2500 or -w 2000.

It might be also needed to decrease values for --mini-batch and --maxi-batch as they together determine the number of sentences that are pre-loaded for batch preparation. Which example do you use?

Adding --disp-first 10 to the training command will help to determine if training has started correctly.

Supachan commented 5 years ago

Hi Snukky, thank you for your response. I ran example : marian-example (here) and set -w 2000, but it didn't work as follows:

[2019-05-17 01:48:18] Using single-device training
[2019-05-17 01:48:18] [data] Loading vocabulary from JSON/Yaml file model/vocab.ro.yml
[2019-05-17 01:48:18] [data] Setting vocabulary size for input 0 to 66000
[2019-05-17 01:48:18] [data] Loading vocabulary from JSON/Yaml file model/vocab.en.yml
[2019-05-17 01:48:19] [data] Setting vocabulary size for input 1 to 50000
[2019-05-17 01:48:19] [batching] Collecting statistics for batch fitting with step size 10
[2019-05-17 01:48:19] [memory] Extending reserved space to 2048 MB (device gpu0)
[2019-05-17 01:48:19] [memory] Reserving 453 MB, device gpu0
[2019-05-17 01:48:19] [memory] Reserving 453 MB, device gpu0
[2019-05-17 01:48:32] [batching] Done
[2019-05-17 01:48:32] [memory] Extending reserved space to 2048 MB (device gpu0)
[2019-05-17 01:48:33] Training started
[2019-05-17 01:48:33] [data] Shuffling files
[2019-05-17 01:48:33] [data] Done reading 2390233 sentences
[2019-05-17 01:48:40] [data] Done shuffling 2390233 sentences to temp files
[2019-05-17 01:48:41] [memory] Reserving 453 MB, device gpu0
[2019-05-17 01:48:41] [memory] Reserving 453 MB, device gpu0
[2019-05-17 01:48:42] [memory] Reserving 906 MB, device gpu0
[2019-05-17 01:48:42] Error: CUDA error 2 'out of memory' - /tmp/marian/src/tensors/gpu/device.cu:38: cudaMalloc(&data_, size)
[2019-05-17 01:48:42] Error: Aborted from virtual void marian::gpu::Device::reserve(size_t) in /tmp/marian/src/tensors/gpu/device.cu:38

[CALL STACK]
[0x1a7c7b1]         marian::gpu::Device::  reserve  (unsigned long)    + 0x1401
[0x753ed3]          marian::TensorAllocator::  reserveExact  (unsigned long) + 0x1c3
[0x7f03a7]          marian::Adam::  updateImpl  (std::shared_ptr<marian::TensorBase>,  std::shared_ptr<marian::TensorBase>) + 0x3a7
[0x90a52d]          marian::SingletonGraph::  execute  (std::shared_ptr<marian::data::Batch>) + 0x25d
[0x90dc33]          marian::SingletonGraph::  update  (std::shared_ptr<marian::data::Batch>) + 0x293
[0x6679e8]          marian::Train<marian::SingletonGraph>::  run  ()   + 0xa48
[0x59cc33]          mainTrainer  (int,  char**)                        + 0x553
[0x57afba]          main                                               + 0x8a
[0x7fc3057f2830]    __libc_start_main                                  + 0xf0
[0x59a219]          _start                                             + 0x29

Below code as default:

    $MARIAN_TRAIN \
        --devices       $GPUS \
        --type          amun \
        --model         model/model.npz \
        --train-sets    data/corpus.bpe.ro data/corpus.bpe.en \
        --vocabs        model/vocab.ro.yml model/vocab.en.yml \
        --dim-vocabs    66000 50000 \
        --mini-batch-fit -w 2000 \
        #--max-length   100 \
        --layer-normalization --dropout-rnn 0.2 --dropout-src 0.1 --dropout-trg 0.1 \
        --early-stopping 5 \
        --valid-freq    10000 --save-freq 10000 --disp-freq 10 \
        --valid-metrics cross-entropy translation \
        --valid-sets    data/newsdev2016.bpe.ro data/newsdev2016.bpe.en \
        --valid-script-path "bash ./scripts/validate.sh" \
        --log model/train.log --valid-log model/valid.log \
        --overwrite --keep-best \
        --seed 1111 --exponential-smoothing \
        --normalize=1 --beam-size=12 --quiet-translation

I have no idea how to figure it out. PS: I have recently changed -w 2000 to -w 512 and it is still running... I will let you know if anything updates

Fri May 17 09:00:12 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   67C    P0    N/A /  N/A |   3011MiB /  4040MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3832      G   /usr/lib/xorg/Xorg                           316MiB |
|    0      4166      G   /usr/bin/gnome-shell                         150MiB |
|    0      4818      G   ...-token=965E4CBF51B50D2B248BFB2AE55C36FF    77MiB |
|    0      5205      G   ...uest-channel-token=17371207802130046407    42MiB |
|    0      7635      C   ../../build/marian                          2421MiB |
+-----------------------------------------------------------------------------+
Supachan commented 5 years ago

It works, but takes time very long! let's close this issue. Thanks.