marian-nmt / marian-examples

Examples, tutorials and use cases for Marian, including our WMT-2017/18 baselines.
Other
78 stars 34 forks source link

WMT-2017 transformer example: OOM error #3

Closed MaksymDel closed 6 years ago

MaksymDel commented 6 years ago

Part of my stdout output:

[2018-03-29 18:23:41] Starting epoch 9
[2018-03-29 18:23:41] Training finished
[2018-03-29 18:23:46] Saving model to model/ens1/model.npz.best-ce-mean-words.npz
[2018-03-29 18:23:50] [valid] 16 : ce-mean-words : 7.50067 : new best
[2018-03-29 18:23:55] Saving model to model/ens1/model.npz.best-perplexity.npz
[2018-03-29 18:23:59] [valid] 16 : perplexity : 1809.25 : new best
tcmalloc: large alloc 1073741824 bytes == 0x2ea1a000 @ 
tcmalloc: large alloc 1610612736 bytes == 0x6ea1a000 @ 
tcmalloc: large alloc 2147483648 bytes == 0xf794000 @ 
tcmalloc: large alloc 2684354560 bytes == 0xf794000 @ 
tcmalloc: large alloc 3221225472 bytes == 0xcea1a000 @ 
tcmalloc: large alloc 3758096384 bytes == 0xfebe000 @ 
tcmalloc: large alloc 4294967296 bytes == 0x108ae000 @ 
tcmalloc: large alloc 4831838208 bytes == 0x10776000 @ 
tcmalloc: large alloc 5368709120 bytes == 0x10e64000 @ 
[2018-03-29 18:25:38] Error: out of memory - /storage/software/marian/src/marian/src/tensors/gpu/device.cu:30
./run-me.sh: line 108:  6273 Aborted

After that script continues.

I use 16gb GPU to train the model. Any ideas on this?

MaksymDel commented 6 years ago

Resolved by removing all the model folder, regenerating data and re-running the script from scratch.

By default Marian resumes training when it sees that model folders are not free, right?

emjotde commented 6 years ago

Yes. It does. With the example it is still a little bit wacky as the smoothed models (--exponential-smoothing) should not be the models which are used for resuming, but it does not seem to do harm either. We are currently working on making this fully correct.

emjotde commented 6 years ago

BTW, these are the lines counts for files in the data folder:

   19122526 data/all.bpe.de
   19122526 data/all.bpe.en
    4561263 data/corpus.bpe.de
    4561263 data/corpus.bpe.en
    4590101 data/corpus.de
    4590101 data/corpus.en
    4561263 data/corpus.tc.de
    4561263 data/corpus.tc.en
     157788 data/corpus.tok.de
    4590101 data/corpus.tok.en
    4590101 data/corpus.tok.uncleaned.de
    4590101 data/corpus.tok.uncleaned.en
   10000000 data/news.2016.bpe.de
   10000000 data/news.2016.bpe.en
   10000000 data/news.2016.de
   10000000 data/news.2016.tc.de
   10000000 data/news.2016.tok.de
       2737 data/test2014.bpe.en
       2737 data/test2014.en
       2737 data/test2014.tc.en
       2737 data/test2014.tok.en
       2169 data/test2015.bpe.en
       2169 data/test2015.en
       2169 data/test2015.tc.en
       2169 data/test2015.tok.en
       2999 data/test2016.bpe.en
       2999 data/test2016.en
       2999 data/test2016.tc.en
       2999 data/test2016.tok.en
       3004 data/test2017.bpe.en
       3004 data/test2017.en
       3004 data/test2017.tc.en
       3004 data/test2017.tok.en
       2999 data/valid.bpe.de
       2999 data/valid.bpe.en
       2999 data/valid.de
       2999 data/valid.en
       2999 data/valid.tc.de
       2999 data/valid.tc.en
       2999 data/valid.tok.de
       2999 data/valid.tok.en
MaksymDel commented 6 years ago

Thanks!

Closing for now.