CUDA illegal memory access and the same matrix was transferred between different devices for 20 times

kaituoxu commented 6 years ago

Hi community,

I add a new ComputationNode named BiVfsmnNode to implement FSMN (Feedforward Sequential Memory Network)[1], but now I can't get results that I expect by this new BiVfsmnNode. Maybe I miss some CNTK details when I implement BiVfsmnNode, so I'd like to ask for some help to review my code or some guides. Thanks in advance.

My CNTK branch (t-kax/bfsmn) and commit are here: https://github.com/Microsoft/CNTK/commit/8deb528b7298f988a614f8570ef1211ccdc446d6 The FSMN formulas are in section 3. DEEP-FSMN of paper[1].

I want to try this new model because FSMN consistently outperforms the BLSTM and LSTM with dramatic gain in speech recognition task[1]. I need to implement FSMN in a new ComputationNode way because if I implement FSMN using original BrainScript, the training time of FSMN will be at least 5 times longer than LSTM. But the main advantage of FSMN is that it can be trained as fast as DNN, because we can put the FSMN computation in a CUDA kernel.

I already implemented FSMN in other framework (ASR toolkit Kaldi nnet1) and FSMN outperforms LSTM by 10%[2]. I also implemented FSMN using original BrainScipt and FSMN also outperforms LSTM by 9% now. But when I train FSMN using my BiVfsmnNode, it is worse than LSTM now. So I guess maybe I miss some CNTK details when I add this new ComputationNode.

[1] https://arxiv.org/abs/1803.05030 [2] https://github.com/kaituoxu/kaldi/tree/ktnet1

Thanks for your general help!

ke1337 commented 6 years ago

Please try to build minimal test using Python. It would be quite trivia to extend since you already added in V2 C++ API. Here is an example commit for adding a new node for both BrainScript and Python.

kaituoxu commented 6 years ago

Thanks. I'll try it.

kaituoxu commented 6 years ago

Problem fixed. https://github.com/Microsoft/CNTK/commit/9cabc8d8af0985f39f652818ed3ab07591f0d97e

It because I have wrong understanding of one minibatch matrix's layout. The minibatch matrix logical layout is:

-------
1111100
2220000
-------

I misunderstand the phyical layout before like this:

---------------
11111002220000
---------------

The matrix phyical layout should be:

---------------
12121210100000
---------------

ke1337 commented 6 years ago

Yes, sequence axis is the outer loop for locality in the same time step in RNN across sequences in a minibatch. You may use sequence.unpack to get batch axis as the outer loop, and sequence axis would become padded to a FreeDimension.

kaituoxu commented 6 years ago

Hi @KeDengMS , when I train a model using my BiVfsmnNode, the log always shows:

08/06/2018 04:26:44: Starting minibatch loop.
WARNING: The same matrix with dim [1, 2205] has been transferred between different devices for 20 times.
WARNING: The same matrix with dim [1, 2205] has been transferred between different devices for 20 times.
WARNING: The same matrix with dim [1, 2205] has been transferred between different devices for 20 times.
WARNING: The same matrix with dim [1, 2205] has been transferred between different devices for 20 times.
WARNING: The same matrix with dim [1, 2205] has been transferred between different devices for 20 times.
08/06/2018 05:06:30:  Epoch[ 1 of 5]-Minibatch[   1- 250, 2.37%]: ce = 8.23480233 * 446088; err = 0.94972741 * 446088; time = 2385.5462s; samplesPerSecond = 187.0

I construct Matrix<ElemType> m_flags in BiVfsmnNode's ForwardProp() and use it 3 times in BackpropTo(), and it's shape is 1 x minibatch-size. The minibatch-size in this experiment is 2048, so I guess this transferred matrix maybe is m_flags.

Actually, when I use original BrainScript to implement FSMN but not use my BiVfsmnNode, I also get this log.

How should I avoid transferring this matrix multi times?

Thanks!

kaituoxu commented 6 years ago

I also will get illegal memory access error under some model config:

cudaStreamDestroy failed (PrefetchGPUDataTransferer dtor): an illegal memory access was encountered (cuda error 77)
cudaStreamDestroy failed (PrefetchGPUDataTransferer dtor): an illegal memory access was encountered (cuda error 77)
terminate called after throwing an instance of 'Microsoft::MSR::CNTK::ExceptionWithCallStack<std::runtime_error>'
  what():  Free in CUDAPageLockedMemAllocator failed: an illegal memory access was encountered (cuda error 77)

But some other config work fine. If I use smaller minibatch size, this error sometimes will go away.

I use gdb to debug the core dump and get below information:

$ gdb cntk core
(gdb) where
#0  0x00007f6a97152428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f6a9715402a in __GI_abort () at abort.c:89
#2  0x00007f6a97cb784d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f6a97cb56b6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f6a97cb46a9 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f6a97cb5005 in __gxx_personality_v0 () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f6a974f6f83 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7  0x00007f6a974f7487 in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#8  0x0000000001f778d0 in Microsoft::MSR::CNTK::ThrowFormattedVA<std::runtime_error>(const char *, typedef __va_list_tag __va_list_tag *) (format=0x7f6a9a5d428c "%s: %s (cuda error %d)", args
=0x7ffeff496f50) at Source/Common/Include/Basics.h:59
#9  0x0000000001f779b2 in Microsoft::MSR::CNTK::ThrowFormatted<std::runtime_error> (format=0x7f6a9a5d428c "%s: %s (cuda error %d)") at Source/Common/Include/Basics.h:94
#10 0x00007f6a9912afb4 in Microsoft::MSR::CNTK::CheckCudaReturnCode (rc=cudaErrorIllegalAddress, msg=0x7f6a9a5d42f0 "Free in CUDAPageLockedMemAllocator failed") at Source/Math/CUDAPageLockedM
emAllocator.cpp:15
#11 0x00007f6a9912b0b5 in Microsoft::MSR::CNTK::CUDAPageLockedMemAllocator::Free (p=0x204c80000, deviceId=3) at Source/Math/CUDAPageLockedMemAllocator.cpp:36
#12 0x00007f6a9912b105 in Microsoft::MSR::CNTK::CUDAPageLockedMemAllocator::Free (this=0xa0d0400, p=0x204c80000) at Source/Math/CUDAPageLockedMemAllocator.cpp:46
#13 0x00007f6a9a5bff4d in CNTK::CudaMemoryProvider::Free (this=0x755cc30, p=0x204c80000) at Source/Readers/ReaderLib/CudaMemoryProvider.h:40
#14 0x00007f6a9a5b7f65 in CNTK::PackerBase::StreamBuffer::<lambda(char*)>::operator()(char *) const (__closure=0x7f64f833cd50, p=0x204c80000 "\231\032\063Ac\330\067A\033>@Ax\037>A\266\067QAOq
QAv\376JA\313.2A\216;'A\344P'A-J%A\365\205\036A\216:\026A\237\357\024A\024u\026A\316\337\024A\370n\tA]\233&A\232)9A\270xeA\034qxAVAuA\027GjA\216qWA\273jgAl\312hA\236QjA\033\261fA\362\201nA\02
5\211kA\036\245mAcB\200A\236A\210PxA\222\360uA\227euA\275U\204A\240I\215A\016\374\207A\254%\215Axg\215A\245f\214A&'\222A+-\232A\351A\020\271\231A\372\071\225A\206S\233A\r}\234A5P\225A"...) at
 Source/Readers/ReaderLib/PackerBase.cpp:28
#15 0x00007f6a9a5b9f88 in std::_Sp_counted_deleter<char*, CNTK::PackerBase::StreamBuffer::Resize(size_t)::<lambda(char*)>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2u>::_M_dispose(void)
 (this=0x7f64f833cd40) at /usr/include/c++/5/bits/shared_ptr_base.h:466
#16 0x00000000020d908e in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f64f833cd40) at /usr/include/c++/5/bits/shared_ptr_base.h:150
#17 0x00000000020bde15 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x765ef50, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:659
#18 0x00007f6a9a5a8db6 in std::__shared_ptr<char, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x765ef48, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:925
#19 0x00007f6a9a5a8dd2 in std::shared_ptr<char>::~shared_ptr (this=0x765ef48, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr.h:93
#20 0x00007f6a9a5ad120 in CNTK::PackerBase::StreamBuffer::~StreamBuffer (this=0x765ef30, __in_chrg=<optimized out>) at Source/Readers/ReaderLib/PackerBase.h:21
#21 0x00007f6a9a5ad14b in std::_Destroy<CNTK::PackerBase::StreamBuffer> (__pointer=0x765ef30) at /usr/include/c++/5/bits/stl_construct.h:93
#22 0x00007f6a9a5acfe7 in std::_Destroy_aux<false>::__destroy<CNTK::PackerBase::StreamBuffer*> (__first=0x765ef30, __last=0x765ef80) at /usr/include/c++/5/bits/stl_construct.h:103
#23 0x00007f6a9a5acd0c in std::_Destroy<CNTK::PackerBase::StreamBuffer*> (__first=0x765ef30, __last=0x765ef80) at /usr/include/c++/5/bits/stl_construct.h:126
#24 0x00007f6a9a5ac833 in std::_Destroy<CNTK::PackerBase::StreamBuffer*, CNTK::PackerBase::StreamBuffer> (__first=0x765ef30, __last=0x765ef80) at /usr/include/c++/5/bits/stl_construct.h:151
#25 0x00007f6a9a5ac175 in std::vector<CNTK::PackerBase::StreamBuffer, std::allocator<CNTK::PackerBase::StreamBuffer> >::~vector (this=0x77cf7e0, __in_chrg=<optimized out>) at /usr/include/c++
/5/bits/stl_vector.h:424
#26 0x00007f6a9a5abc67 in std::_Destroy<std::vector<CNTK::PackerBase::StreamBuffer, std::allocator<CNTK::PackerBase::StreamBuffer> > > (__pointer=0x77cf7e0) at /usr/include/c++/5/bits/stl_con
struct.h:93
#27 0x00007f6a9a5ab797 in std::_Destroy_aux<false>::__destroy<std::vector<CNTK::PackerBase::StreamBuffer, std::allocator<CNTK::PackerBase::StreamBuffer> >*> (__first=0x77cf7e0, __last=0x77cf8
10) at /usr/include/c++/5/bits/stl_construct.h:103
#28 0x00007f6a9a5aadf8 in std::_Destroy<std::vector<CNTK::PackerBase::StreamBuffer, std::allocator<CNTK::PackerBase::StreamBuffer> >*> (__first=0x77cf7e0, __last=0x77cf810) at /usr/include/c+
+/5/bits/stl_construct.h:126
#29 0x00007f6a9a5aa19b in std::_Destroy<std::vector<CNTK::PackerBase::StreamBuffer, std::allocator<CNTK::PackerBase::StreamBuffer> >*, std::vector<CNTK::PackerBase::StreamBuffer, std::allocat
or<CNTK::PackerBase::StreamBuffer> > > (__first=0x77cf7e0, __last=0x77cf810) at /usr/include/c++/5/bits/stl_construct.h:151
#30 0x00007f6a9a5a937f in std::vector<std::vector<CNTK::PackerBase::StreamBuffer, std::allocator<CNTK::PackerBase::StreamBuffer> >, std::allocator<std::vector<CNTK::PackerBase::StreamBuffer,
std::allocator<CNTK::PackerBase::StreamBuffer> > > >::~vector (this=0x754f5c0, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/stl_vector.h:424
#31 0x00007f6a9a5a9034 in CNTK::PackerBase::~PackerBase (this=0x754f570, __in_chrg=<optimized out>) at Source/Readers/ReaderLib/PackerBase.h:17
#32 0x00007f6a9a5ad18a in CNTK::SequencePacker::~SequencePacker (this=0x754f570, __in_chrg=<optimized out>) at Source/Readers/ReaderLib/SequencePacker.h:14
#33 0x00007f6a333c3dbf in __gnu_cxx::new_allocator<CNTK::SequencePacker>::destroy<CNTK::SequencePacker> (this=0x754f570, __p=0x754f570) at /usr/include/c++/5/ext/new_allocator.h:124
#34 0x00007f6a333c3b4f in std::allocator_traits<std::allocator<CNTK::SequencePacker> >::destroy<CNTK::SequencePacker> (__a=..., __p=0x754f570) at /usr/include/c++/5/bits/alloc_traits.h:542
#35 0x00007f6a333c2cdb in std::_Sp_counted_ptr_inplace<CNTK::SequencePacker, std::allocator<CNTK::SequencePacker>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x754f560) at /usr/include/c++
/5/bits/shared_ptr_base.h:531
#36 0x00000000020d908e in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x754f560) at /usr/include/c++/5/bits/shared_ptr_base.h:150
#37 0x00000000020bde15 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x72ec780, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:659
#38 0x00007f6a9a5bffb6 in std::__shared_ptr<CNTK::Packer, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x72ec778, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:9
25
#39 0x00007f6a9a5bffd2 in std::shared_ptr<CNTK::Packer>::~shared_ptr (this=0x72ec778, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr.h:93
#40 0x00007f6a9a5bf6aa in CNTK::ReaderBase::~ReaderBase (this=0x72ec750, __in_chrg=<optimized out>) at Source/Readers/ReaderLib/ReaderBase.cpp:18
#41 0x00007f6a333c29ce in CNTK::CompositeDataReader::~CompositeDataReader (this=0x72ec750, __in_chrg=<optimized out>) at Source/Readers/CompositeDataReader/CompositeDataReader.h:52
#42 0x00007f6a333c7ea5 in __gnu_cxx::new_allocator<CNTK::CompositeDataReader>::destroy<CNTK::CompositeDataReader> (this=0x72ec750, __p=0x72ec750) at /usr/include/c++/5/ext/new_allocator.h:124
#43 0x00007f6a333c7dc1 in std::allocator_traits<std::allocator<CNTK::CompositeDataReader> >::destroy<CNTK::CompositeDataReader> (__a=..., __p=0x72ec750) at /usr/include/c++/5/bits/alloc_trait
s.h:542
#44 0x00007f6a333c79e7 in std::_Sp_counted_ptr_inplace<CNTK::CompositeDataReader, std::allocator<CNTK::CompositeDataReader>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x72ec740) at /usr/i
nclude/c++/5/bits/shared_ptr_base.h:531
#45 0x00000000020d908e in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x72ec740) at /usr/include/c++/5/bits/shared_ptr_base.h:150
#46 0x00000000020bde15 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7313df0, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:659
#47 0x00007f6a9a585df0 in std::__shared_ptr<CNTK::Reader, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7313de8, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:9
25
#48 0x00007f6a9a585e28 in std::shared_ptr<CNTK::Reader>::~shared_ptr (this=0x7313de8, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr.h:93
#49 0x00007f6a9a580382 in CNTK::ReaderShim<float>::~ReaderShim (this=0x7313dc0, __in_chrg=<optimized out>) at Source/Readers/ReaderLib/ReaderShim.h:40
#50 0x00007f6a333c78fa in CNTK::CompositeReaderShim<float>::~CompositeReaderShim (this=0x7313dc0, __in_chrg=<optimized out>) at Source/Readers/CompositeDataReader/Exports.cpp:21
#51 0x00007f6a333c792a in CNTK::CompositeReaderShim<float>::~CompositeReaderShim (this=0x7313dc0, __in_chrg=<optimized out>) at Source/Readers/CompositeDataReader/Exports.cpp:21
#52 0x00007f6a9a5808b2 in CNTK::ReaderShim<float>::Destroy (this=0x7313dc0) at Source/Readers/ReaderLib/ReaderShim.h:60
#53 0x00007f6a9a509bca in Microsoft::MSR::CNTK::DataReader::~DataReader (this=0x652d230, __in_chrg=<optimized out>) at Source/Common/DataReader.cpp:158
#54 0x00000000022026cd in __gnu_cxx::new_allocator<Microsoft::MSR::CNTK::DataReader>::destroy<Microsoft::MSR::CNTK::DataReader> (this=0x652d230, __p=0x652d230) at /usr/include/c++/5/ext/new_a
llocator.h:124
#55 0x000000000220256b in std::allocator_traits<std::allocator<Microsoft::MSR::CNTK::DataReader> >::destroy<Microsoft::MSR::CNTK::DataReader> (__a=..., __p=0x652d230) at /usr/include/c++/5/bi
ts/alloc_traits.h:542
#56 0x0000000002201f7d in std::_Sp_counted_ptr_inplace<Microsoft::MSR::CNTK::DataReader, std::allocator<Microsoft::MSR::CNTK::DataReader>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x652d
220) at /usr/include/c++/5/bits/shared_ptr_base.h:531
#57 0x00000000020d908e in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x652d220) at /usr/include/c++/5/bits/shared_ptr_base.h:150
#58 0x00000000020bde15 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7ffeff4977b8, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:659
#59 0x00000000021d2748 in std::__shared_ptr<Microsoft::MSR::CNTK::DataReader, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7ffeff4977b0, __in_chrg=<optimized out>) at /usr/include/c++/5
/bits/shared_ptr_base.h:925
#60 0x00000000021d2764 in std::shared_ptr<Microsoft::MSR::CNTK::DataReader>::~shared_ptr (this=0x7ffeff4977b0, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr.h:93
#61 0x00000000021cd53f in DoTrain<Microsoft::MSR::CNTK::ConfigParameters, float> (config=...) at Source/ActionsLib/TrainActions.cpp:119
#62 0x00000000020d7066 in DoCommands<float> (config=..., mpi=std::shared_ptr (empty) 0x0) at Source/CNTK/CNTK.cpp:281
#63 0x000000000202efd6 in wmainOldCNTKConfig (argc=6, argv=0x5088c50) at Source/CNTK/CNTK.cpp:752
#64 0x000000000202f79b in wmain1 (argc=6, argv=0x5088c50) at Source/CNTK/CNTK.cpp:811
#65 0x000000000202fc9a in main (argc=6, argv=0x7ffeff498778) at Source/CNTK/CNTK.cpp:921

ke1337 commented 6 years ago

Not sure about the exact cause for the extra memory transfer, the best way is to have a debug breakpoint in that warning and check the callstack. Maybe not related, but you should allocate from MatrixPool for m_flags. CNTK statically plans memory allocation in MatrixPool, and most memory would be used by it. Adhoc allocation (using m_flags.resize) during runtime is not recommended as the rest free memory may not be enough.

The reader issue seems memory related too, or it may have something to do with randomize window if it's used. Please try tweak the reader settings. Besides, you may try SetGPUMemoryAllocationTraceLevel to see if there are any clues of OOM.

kaituoxu commented 6 years ago

Hi @KeDengMS . Thanks for your help.

Memory transfer comes from m_flags, because I need to construct a new m_flags in each ForwardProp().

microsoft / CNTK

CUDA illegal memory access and the same matrix was transferred between different devices for 20 times #3329