marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.21k stars 227 forks source link

CUDA error (illegal memory access) and loss being nan when training big transformer model #371

Open jorgtied opened 3 years ago

jorgtied commented 3 years ago

Bug description

Training breaks with

[2021-05-12 10:19:19] [training] skipping 250846-th update due to loss being nan
[2021-05-12 10:19:19] Error: CUDA error 700 'an illegal memory access was encountered' - /users/tiedeman/projappl/install/marian/src/tensors/gpu/cuda_helpers.h:67: cudaMemcpy(dest, start, (end - start) * sizeof(T), cudaMemcpyDefault)
[2021-05-12 10:19:19] Error: Aborted from void CudaCopy(const T*, const T*, T*) [with T = unsigned int] in /users/tiedeman/projappl/install/marian/src/tensors/gpu/cuda_helpers.h:67

when training big transformer models on NVIDIA v100.

How to reproduce

my command line:

marian --guided-alignment /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/train/opus.spm32k-spm32k.src-trg.alg.gz --early-stopping 10 --valid-freq 10000 --valid-sets /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/val/Tatoeba-dev.src.spm32k /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/val/Tatoeba-dev.trg.spm32k --valid-metrics perplexity --valid-mini-batch 16 --valid-log /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/opus.spm32k-spm32k.transformer-big-align.valid1.log --beam-size 12 --normalize 1 --allow-unk --overwrite --keep-best --model /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/opus.spm32k-spm32k.transformer-big-align.model1.npz --train-sets /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/train/opus.src.clean.spm32k.gz /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/train/opus.trg.clean.spm32k.gz --max-length 500 --vocabs /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/opus.spm32k-spm32k.vocab.yml /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/opus.spm32k-spm32k.vocab.yml --mini-batch-fit -w 24000 --maxi-batch 500 --save-freq 10000 --disp-freq 10000 --log /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/opus.spm32k-spm32k.transformer-big-align.train1.log --type transformer --enc-depth 12 --dec-depth 6 --dim-emb 1024 --transformer-heads 16 --transformer-postprocess-emb d --transformer-postprocess dan --transformer-dropout 0.1 --label-smoothing 0.1 --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --fp16 --tied-embeddings-all --devices 0 1 2 3 --sync-sgd --seed 1111 --sqlite --tempdir /run/nvme/job_5803302/data --exponential-smoothing

Size of training data: ca 45 million sentence pairs. Training works fine with smaller transformer models on the same data set.

Context

[CALL STACK] [0x1b0e567] void marian::gpu:: fill (std::shared_ptr, float, float, float) + 0x627 [0x1389e3d] void marian::TensorBase:: set (float) + 0x35d [0x157a7ca]
[0x157e463] marian::inits::LambdaInit:: apply (IntrusivePtr) + 0x33 [0x157443f] marian::ConstantNode:: init () + 0x3f [0x1565b2d] marian::ExpressionGraph:: forward (std::cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr>>>>&, bool) + 0x5d [0x15672f5] marian::ExpressionGraph:: forwardNext () + 0x2c5 [0x1734548]
[0x17cb2b4] marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1}:: operator() () const + 0x54 [0x17cbdb0] std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::
future_base::_Result_base::_Deleter> (),std::future_base::_Task_setter<std::unique_ptr<std::future_base::_Result,std::future_base::_Result_base::_Deleter>,std::future_base::_Task_state<marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1},std::allocator,void ()>::_M_run()::{lambda()#1},void>>:: _M_invoke (std::_Any_data const&) + 0x20 [0x12e032b] std::future_base::_State_baseV2:: _M_do_set (std::function<std::unique_ptr<std::__future_base::_Result_base,std::future_base::_Result_base::_Deleter> ()>, bool) + 0x1b [0x7f694073d20b] + 0x620b [0x17c0e38] std::_Function_handler<void (),marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#3}>:: _M_invoke (std::_Any_data const&) + 0x108 [0x12e1d27] std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>:: _M_run () + 0x157 [0x504ab20]
[0x7f694073eea5] + 0x7ea5 [0x7f69401658cd] clone + 0x6d

[CALL STACK] [0x1b0e567] void marian::gpu:: fill (std::shared_ptr, float, float, float) + 0x627 [0x1389e3d] void marian::TensorBase:: set (float) + 0x35d [0x157a7ca]
[0x157e463] marian::inits::LambdaInit:: apply (IntrusivePtr) + 0x33 [0x157443f] marian::ConstantNode:: init () + 0x3f [0x1565b2d] marian::ExpressionGraph:: forward (std::cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr>>>>&, bool) + 0x5d [0x15672f5] marian::ExpressionGraph:: forwardNext () + 0x2c5 [0x1734548]
[0x17cb2b4] marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1}:: operator() () const + 0x54 [0x17cbdb0] std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::
future_base::_Result_base::_Deleter> (),std::future_base::_Task_setter<std::unique_ptr<std::future_base::_Result,std::future_base::_Result_base::_Deleter>,std::future_base::_Task_state<marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1},std::allocator,void ()>::_M_run()::{lambda()#1},void>>:: _M_invoke (std::_Any_data const&) + 0x20 [0x12e032b] std::future_base::_State_baseV2:: _M_do_set (std::function<std::unique_ptr<std::__future_base::_Result_base,std::future_base::_Result_base::_Deleter> ()>, bool) + 0x1b [0x7f694073d20b] + 0x620b [0x17c0e38] std::_Function_handler<void (),marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#3}>:: _M_invoke (std::_Any_data const&) + 0x108 [0x12e1d27] std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>:: _M_run () + 0x157 [0x504ab20]
[0x7f694073eea5] + 0x7ea5 [0x7f69401658cd] clone + 0x6d

[CALL STACK] [0x1b0e567] void marian::gpu:: fill (std::shared_ptr, float, float, float) + 0x627 [0x1389e3d] void marian::TensorBase:: set (float) + 0x35d [0x157a7ca]
[0x157e463] marian::inits::LambdaInit:: apply (IntrusivePtr) + 0x33 [0x157443f] marian::ConstantNode:: init () + 0x3f [0x1565b2d] marian::ExpressionGraph:: forward (std::cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr>>>>&, bool) + 0x5d [0x15672f5] marian::ExpressionGraph:: forwardNext () + 0x2c5 [0x1734548]
[0x17cb2b4] marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1}:: operator() () const + 0x54 [0x17cbdb0] std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::
future_base::_Result_base::_Deleter> (),std::future_base::_Task_setter<std::unique_ptr<std::future_base::_Result,std::future_base::_Result_base::_Deleter>,std::future_base::_Task_state<marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1},std::allocator,void ()>::_M_run()::{lambda()#1},void>>:: _M_invoke (std::_Any_data const&) + 0x20 [0x12e032b] std::future_base::_State_baseV2:: _M_do_set (std::function<std::unique_ptr<std::__future_base::_Result_base,std::future_base::_Result_base::_Deleter> ()>, bool) + 0x1b [0x7f694073d20b] + 0x620b [0x17c0e38] std::_Function_handler<void (),marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#3}>:: _M_invoke (std::_Any_data const&) + 0x108 [0x12e1d27] std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>:: _M_run () + 0x157 [0x504ab20]
[0x7f694073eea5] + 0x7ea5 [0x7f69401658cd] clone + 0x6d

[CALL STACK] [0x1afe617] void CudaCopy (unsigned int const, unsigned int const, unsigned int) + 0x3f7 [0x1aff02e] void marian::gpu:: copy (std::shared_ptr, unsigned int const, unsigned int const, unsigned int) + 0x45e [0x1587e63] void marian::TensorBase:: set (unsigned int const, unsigned int const) + 0x6f3 [0x15880f0] std::_Function_handler<void (IntrusivePtr),marian::inits::fromVector(std::vector<unsigned int,std::allocator> const&)::{lambda(IntrusivePtr)#1}>:: _M_invoke (std::_Any_data const&, IntrusivePtr&&) + 0x20 [0x1581137] marian::inits::LambdaInitConvert:: apply (IntrusivePtr) + 0x67 [0x157443f] marian::ConstantNode:: init () + 0x3f [0x1565b2d] marian::ExpressionGraph:: forward (std::cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr>>>>&, bool) + 0x5d [0x15672f5] marian::ExpressionGraph:: forwardNext () + 0x2c5 [0x1734548]
[0x17cb2b4] marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1}:: operator() () const + 0x54 [0x17cbdb0] std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::
future_base::_Result_base::_Deleter> (),std::future_base::_Task_setter<std::unique_ptr<std::future_base::_Result,std::future_base::_Result_base::_Deleter>,std::future_base::_Task_state<marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1},std::allocator,void ()>::_M_run()::{lambda()#1},void>>:: _M_invoke (std::_Any_data const&) + 0x20 [0x12e032b] std::future_base::_State_baseV2:: _M_do_set (std::function<std::unique_ptr<std::__future_base::_Result_base,std::future_base::_Result_base::_Deleter> ()>, bool) + 0x1b [0x7f694073d20b] + 0x620b [0x17c0e38] std::_Function_handler<void (),marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#3}>:: _M_invoke (std::_Any_data const&) + 0x108 [0x12e1d27] std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>:: _M_run () + 0x157 [0x504ab20]
[0x7f694073eea5] + 0x7ea5 [0x7f69401658cd] clone + 0x6d