Can not run training program with cuda 10.2

t13m commented 4 years ago

Hi, I was trying to run eesen in nvidia's docker container, and failed.

The container has cuda 10.2 in it. Eesen can be compiled, but when invoking "train-ctc-parallel", it crash with following logs:

LOG (train-ctc-parallel:DisableCaching():cuda-device.cc:731) Disabling caching of GPU memory. LOG (train-ctc-parallel:SetUpdateAlgorithm():net.cc:483) Selecting SGD with momentum as optimization algorithm. LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 0 LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 1 LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 2 LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 3 LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 4 add-deltas ark:- ark:- copy-feats scp:exp/train_char_l5_c320/train_local.scp ark:- LOG (train-ctc-parallel:main():train-ctc-parallel.cc:133) TRAINING STARTED ERROR (train-ctc-parallel:AddVecToRows():cuda-matrix.cc:541) cudaError_t 209 : "no kernel image is available for execution on the device" returned from 'cudaGetLastError()' WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe gunzip -c exp/train_char_l5_c320/labels.tr.gz| had nonzero return status 13 WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe copy-feats scp:exp/train_char_l5_c320/train_local.scp ark:- | add-deltas ark:- ark:- | had nonzero return status 36096 ERROR (train-ctc-parallel:AddVecToRows():cuda-matrix.cc:541) cudaError_t 209 : "no kernel image is available for execution on the device" returned from 'cudaGetLastError()' [stack trace: ] eesen::KaldiGetStackTrace[abi:cxx11]() eesen::KaldiErrorMessage::~KaldiErrorMessage() eesen::CuMatrixBase::AddVecToRows(float, eesen::CuVectorBase const&, float) eesen::BiLstmParallel::PropagateFncVanillaPassForward(eesen::CuMatrixBase const&, int, int) eesen::BiLstmParallel::PropagateFnc(eesen::CuMatrixBase const&, eesen::CuMatrixBase) eesen::Layer::Propagate(eesen::CuMatrixBase const&, eesen::CuMatrix) eesen::Net::Propagate(eesen::CuMatrixBase const&, eesen::CuMatrix*) train-ctc-parallel(main+0x148d) [0x5583f00fe692] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f385afb9b97] train-ctc-parallel(_start+0x2a) [0x5583f00fb44a]

Is there any workaround about this? I don't know much about cuda, I tried to add "-gencode arch=compute_{70,72,75},code={70,72,75}" to gpucompute/Makefile but it still crash.

liyongze commented 3 years ago

could you tell me how did you fix this problem? I met the same problem.

t13m commented 3 years ago

Hi, I didn't manage to make it work. My experiments were conducted on cpu.

liyongze commented 3 years ago

thanks for your reply!

t13m @.***> 于2021年8月21日周六下午3:59写道：

Hi, I didn't manage to make it work. My experiments were conducted on cpu.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/220#issuecomment-903078594, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH2ATNTRT6HKFNAXLAPBXMTT55MFTANCNFSM4KKWICMA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

srvk / eesen

Can not run training program with cuda 10.2 #220