kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.24k stars 5.32k forks source link

batched-wav-nnet3-cuda2 is sometimes unable to allocale CUDA memory #4306

Closed kvishnivetsky closed 3 years ago

kvishnivetsky commented 4 years ago

Kaldi version: 5-5.636 CUDA support: yes Driver Version: 440.95.01 CUDA Version: 10.2 OS: CentOS 7 x64 Virtualization: openVZ NVIDIA Hardware: 2 x NVIDIA Tesla T4

All strated from unpredictable Segmentation Faults. After applying patch: https://github.com/kaldi-asr/kaldi/pull/4305 We found out an error in HostDeviceVector::Reallocate method at batched-threaded-nnet3-cuda-pipeline2.h:159

cudaMalloc() error message: an illegal memory access was encountered

dgxlsir commented 3 years ago

i want to use the gpu to decode my chain_model, but do not success, is there some example scrips to help me? or some suggestion?thank u very much!

jtrmal commented 3 years ago

I think you are not specific enough? You don't say what does "not success" means specifically?

I think the default nnet3 decode.sh can do forward pass/inference on GPU, but the speedup is not great. For that, you would have to use the Nvidia's people contribution, but I cannot recall, if there are example scripts.

On Wed, Nov 25, 2020 at 4:18 AM 刘春平 notifications@github.com wrote:

i want to use the gpu to decode my chain_model, but do not success, is there some example scrips to help me? or some suggestion?thank u very much!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4306#issuecomment-733576137, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYXZKZVSZYRXLN7MWSOLSRTDWJANCNFSM4SZNCS6Q .

dgxlsir commented 3 years ago

I think you are not specific enough? You don't say what does "not success" means specifically? I think the default nnet3 decode.sh can do forward pass/inference on GPU, but the speedup is not great. For that, you would have to use the Nvidia's people contribution, but I cannot recall, if there are example scripts. On Wed, Nov 25, 2020 at 4:18 AM 刘春平 @.***> wrote: i want to use the gpu to decode my chain_model, but do not success, is there some example scrips to help me? or some suggestion?thank u very much! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4306 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYXZKZVSZYRXLN7MWSOLSRTDWJANCNFSM4SZNCS6Q .

yes you are right, I used the Nvidia's people contribution ,but have no example scripts, so i tried to code scripts and use the tools from Nvidia's people contribution, that way not successed,so i need some example scripts for Nvidia's people contribution. thank u for your reply

jtrmal commented 3 years ago

you again not saying what was the error or behavior of "not succeeding" y.

On Wed, Nov 25, 2020 at 7:56 PM 刘春平 notifications@github.com wrote:

I think you are not specific enough? You don't say what does "not success" means specifically? I think the default nnet3 decode.sh can do forward pass/inference on GPU, but the speedup is not great. For that, you would have to use the Nvidia's people contribution, but I cannot recall, if there are example scripts. … <#m-6983103753447765723> On Wed, Nov 25, 2020 at 4:18 AM 刘春平 @.***> wrote: i want to use the gpu to decode my chain_model, but do not success, is there some example scrips to help me? or some suggestion?thank u very much! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4306 (comment) https://github.com/kaldi-asr/kaldi/issues/4306#issuecomment-733576137>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYXZKZVSZYRXLN7MWSOLSRTDWJANCNFSM4SZNCS6Q .

yes you are right, I used the Nvidia's people contribution ,but have no example scripts, so i tried to code scripts and use the tools from Nvidia's people contribution, that way not successed,so i need some example scripts for Nvidia's people contribution. thank u for your reply

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4306#issuecomment-734013325, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX4ANKCZQYJLG4DBER3SRWRVVANCNFSM4SZNCS6Q .

dgxlsir commented 3 years ago

when i include "cudadecoder/xxx.h" there are errors "some variable not declared" for example, when i do :

include "cudadecoder/cuda-decoder.h"

error: ‘CuDevice’ has not been declared

dgxlsir commented 3 years ago

when i include "cudadecoder/xxx.h" there are errors "some variable not declared" for example, when i do :

include "cudadecoder/cuda-decoder.h"

error: ‘CuDevice’ has not been declared add note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)

danpovey commented 3 years ago

If you want to compile against this stuff, it's better if you add a program inside the Kaldi source tree and compile as if it were a Kaldi program. That way you get all the compilation options and flags and #defines.

On Thu, Nov 26, 2020 at 9:39 AM 刘春平 notifications@github.com wrote:

when i include "cudadecoder/xxx.h" there are errors "some variable not declared" for example, when i do :

include "cudadecoder/cuda-decoder.h"

error: ‘CuDevice’ has not been declared add note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4306#issuecomment-734023767, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO25NAIXSBTYMWEG6XTSRWWT3ANCNFSM4SZNCS6Q .

jtrmal commented 3 years ago

plus you are probably describing two issues -- compiling of something and running of something else -- compilation should not end up with "cudaMalloc" failure y.

On Thu, Nov 26, 2020 at 5:42 AM Daniel Povey notifications@github.com wrote:

If you want to compile against this stuff, it's better if you add a program inside the Kaldi source tree and compile as if it were a Kaldi program. That way you get all the compilation options and flags and #defines.

On Thu, Nov 26, 2020 at 9:39 AM 刘春平 notifications@github.com wrote:

when i include "cudadecoder/xxx.h" there are errors "some variable not declared" for example, when i do :

include "cudadecoder/cuda-decoder.h"

error: ‘CuDevice’ has not been declared add note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4306#issuecomment-734023767, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZFLO25NAIXSBTYMWEG6XTSRWWT3ANCNFSM4SZNCS6Q

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4306#issuecomment-734070165, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYXZBETPZWJNE5TZMJDDSRXMCPANCNFSM4SZNCS6Q .

kvishnivetsky commented 3 years ago

plus you are probably describing two issues -- compiling of something and running of something else -- compilation should not end up with "cudaMalloc" failure y. On Thu, Nov 26, 2020 at 5:42 AM Daniel Povey notifications@github.com wrote: If you want to compile against this stuff, it's better if you add a program inside the Kaldi source tree and compile as if it were a Kaldi program. That way you get all the compilation options and flags and #defines. On Thu, Nov 26, 2020 at 9:39 AM 刘春平 @.***> wrote: > when i include "cudadecoder/xxx.h" there are errors "some variable not > declared" > for example, when i do : > #include "cudadecoder/cuda-decoder.h" > error: ‘CuDevice’ has not been declared > add note: (if you use ‘-fpermissive’, G++ will accept your code, but > allowing the use of an undeclared name is deprecated) > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#4306 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAZFLO25NAIXSBTYMWEG6XTSRWWT3ANCNFSM4SZNCS6Q > > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4306 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYXZBETPZWJNE5TZMJDDSRXMCPANCNFSM4SZNCS6Q .

Hi @jtrmal , cudaMalloc was MY issue. And I do not know why @dgxlsir is writing here - he has really "another issue".

jtrmal commented 3 years ago

Ah, two different people. Sorry my bad, I didn't check. Y.

On Wed, Dec 2, 2020 at 11:36 Konstantin S. Vishnivetsky < notifications@github.com> wrote:

plus you are probably describing two issues -- compiling of something and running of something else -- compilation should not end up with "cudaMalloc" failure y. On Thu, Nov 26, 2020 at 5:42 AM Daniel Povey notifications@github.com wrote: … <#m7038906284015639827> If you want to compile against this stuff, it's better if you add a program inside the Kaldi source tree and compile as if it were a Kaldi program. That way you get all the compilation options and flags and

defines. On Thu, Nov 26, 2020 at 9:39 AM 刘春平 @.***> wrote: > when i

include "cudadecoder/xxx.h" there are errors "some variable not > declared"

for example, when i do : > #include "cudadecoder/cuda-decoder.h" > error: ‘CuDevice’ has not been declared > add note: (if you use ‘-fpermissive’, G++ will accept your code, but > allowing the use of an undeclared name is deprecated) > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#4306 (comment) https://github.com/kaldi-asr/kaldi/issues/4306#issuecomment-734023767>, or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAZFLO25NAIXSBTYMWEG6XTSRWWT3ANCNFSM4SZNCS6Q

. > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4306 (comment) https://github.com/kaldi-asr/kaldi/issues/4306#issuecomment-734070165>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYXZBETPZWJNE5TZMJDDSRXMCPANCNFSM4SZNCS6Q .

Hi @jtrmal https://github.com/jtrmal , cudaMalloc was MY issue. And I do not know why @dgxlsir https://github.com/dgxlsir is writing here - he has really "another issue".

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4306#issuecomment-737143725, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYXYKL4TTH3CP7QVFG4DSSYKBBANCNFSM4SZNCS6Q .

kvishnivetsky commented 3 years ago

Hi, Guys.

Is there any progress in my issue with batched-wav-nnet3-cuda2 sometimes unable to allocale CUDA memory ? (initial issue topic)

stale[bot] commented 3 years ago

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

kvishnivetsky commented 3 years ago

ping

danpovey commented 3 years ago

I merged your PR that you said fixed a segfault. It's normal for these kinds of programs to require some configuration adjustments to run on different hardware and with (e.g.) different graphs. At this point there's not enough detail to say that it's a bug or something that requires fixing.

kvishnivetsky commented 3 years ago

Thanks for merging - I remember that. But PR fixes only "uncontrolled" behaviour in mem operations and does NOT fix a "real cause" of this behaviour. What kind of information do you need to find out a "real cause" of this "cudaMalloc() error message: an illegal memory access was encountered" issue?

danpovey commented 3 years ago

it says "an illegal memory access was encountered"? That is not normal. I would probably try to run in cuda-gdb or cuda-memcheck and see if it finds where there is (e.g.) out of bounds access.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

kkm000 commented 3 years ago

I am closing this issue because of inactivity. @kvishnivetsky, if you can repro and provide cuda-memcheck data, please ping me, and I'll reopen it. @-mention me for a faster response!