Closed Supachan closed 5 years ago
A few comments here. -w
sets the size of the workspace that is used to store and process mini-batches in MB. 64MB is too small, and 3000MB seems to be too much for this model type and GPU. However, the attached output of nvidia-smi suggests a corrupted context on the GPU as some process occupies 763MB and is not listed in the list of processes (see this post for more details). Free up this memory or use -w 2500
or -w 2000
.
It might be also needed to decrease values for --mini-batch
and --maxi-batch
as they together determine the number of sentences that are pre-loaded for batch preparation. Which example do you use?
Adding --disp-first 10
to the training command will help to determine if training has started correctly.
Hi Snukky, thank you for your response. I ran example : marian-example (here) and set -w 2000, but it didn't work as follows:
[2019-05-17 01:48:18] Using single-device training
[2019-05-17 01:48:18] [data] Loading vocabulary from JSON/Yaml file model/vocab.ro.yml
[2019-05-17 01:48:18] [data] Setting vocabulary size for input 0 to 66000
[2019-05-17 01:48:18] [data] Loading vocabulary from JSON/Yaml file model/vocab.en.yml
[2019-05-17 01:48:19] [data] Setting vocabulary size for input 1 to 50000
[2019-05-17 01:48:19] [batching] Collecting statistics for batch fitting with step size 10
[2019-05-17 01:48:19] [memory] Extending reserved space to 2048 MB (device gpu0)
[2019-05-17 01:48:19] [memory] Reserving 453 MB, device gpu0
[2019-05-17 01:48:19] [memory] Reserving 453 MB, device gpu0
[2019-05-17 01:48:32] [batching] Done
[2019-05-17 01:48:32] [memory] Extending reserved space to 2048 MB (device gpu0)
[2019-05-17 01:48:33] Training started
[2019-05-17 01:48:33] [data] Shuffling files
[2019-05-17 01:48:33] [data] Done reading 2390233 sentences
[2019-05-17 01:48:40] [data] Done shuffling 2390233 sentences to temp files
[2019-05-17 01:48:41] [memory] Reserving 453 MB, device gpu0
[2019-05-17 01:48:41] [memory] Reserving 453 MB, device gpu0
[2019-05-17 01:48:42] [memory] Reserving 906 MB, device gpu0
[2019-05-17 01:48:42] Error: CUDA error 2 'out of memory' - /tmp/marian/src/tensors/gpu/device.cu:38: cudaMalloc(&data_, size)
[2019-05-17 01:48:42] Error: Aborted from virtual void marian::gpu::Device::reserve(size_t) in /tmp/marian/src/tensors/gpu/device.cu:38
[CALL STACK]
[0x1a7c7b1] marian::gpu::Device:: reserve (unsigned long) + 0x1401
[0x753ed3] marian::TensorAllocator:: reserveExact (unsigned long) + 0x1c3
[0x7f03a7] marian::Adam:: updateImpl (std::shared_ptr<marian::TensorBase>, std::shared_ptr<marian::TensorBase>) + 0x3a7
[0x90a52d] marian::SingletonGraph:: execute (std::shared_ptr<marian::data::Batch>) + 0x25d
[0x90dc33] marian::SingletonGraph:: update (std::shared_ptr<marian::data::Batch>) + 0x293
[0x6679e8] marian::Train<marian::SingletonGraph>:: run () + 0xa48
[0x59cc33] mainTrainer (int, char**) + 0x553
[0x57afba] main + 0x8a
[0x7fc3057f2830] __libc_start_main + 0xf0
[0x59a219] _start + 0x29
Below code as default:
$MARIAN_TRAIN \
--devices $GPUS \
--type amun \
--model model/model.npz \
--train-sets data/corpus.bpe.ro data/corpus.bpe.en \
--vocabs model/vocab.ro.yml model/vocab.en.yml \
--dim-vocabs 66000 50000 \
--mini-batch-fit -w 2000 \
#--max-length 100 \
--layer-normalization --dropout-rnn 0.2 --dropout-src 0.1 --dropout-trg 0.1 \
--early-stopping 5 \
--valid-freq 10000 --save-freq 10000 --disp-freq 10 \
--valid-metrics cross-entropy translation \
--valid-sets data/newsdev2016.bpe.ro data/newsdev2016.bpe.en \
--valid-script-path "bash ./scripts/validate.sh" \
--log model/train.log --valid-log model/valid.log \
--overwrite --keep-best \
--seed 1111 --exponential-smoothing \
--normalize=1 --beam-size=12 --quiet-translation
I have no idea how to figure it out. PS: I have recently changed -w 2000 to -w 512 and it is still running... I will let you know if anything updates
Fri May 17 09:00:12 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... Off | 00000000:01:00.0 On | N/A |
| N/A 67C P0 N/A / N/A | 3011MiB / 4040MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3832 G /usr/lib/xorg/Xorg 316MiB |
| 0 4166 G /usr/bin/gnome-shell 150MiB |
| 0 4818 G ...-token=965E4CBF51B50D2B248BFB2AE55C36FF 77MiB |
| 0 5205 G ...uest-channel-token=17371207802130046407 42MiB |
| 0 7635 C ../../build/marian 2421MiB |
+-----------------------------------------------------------------------------+
It works, but takes time very long! let's close this issue. Thanks.
Hi, I just installed
marian-nmt/marian
and ran the code./run-me.sh
as default in traning-basics, I got error:I checked
--mini-batch-fit -w 3000
that might explode my GPU memory. So, I decreased the size of batch and added max-length:--mini-batch-fit -w 64 --max-length 100
inrun-me.sh
, but the error showedHow can I fix the problem? Please let me know. Thank you in advance. Supachan
PS: my laptop DELL Inspiron 15 7000 Gaming: