nvidia-riva / riva-asrlib-decoder

Standalone implementation of the CUDA-accelerated WFST Decoder available in Riva
78 stars 23 forks source link

cudaError_t 700 : "an illegal memory access was encounte red" returned from 'cudaStreamSynchronize(compute_st_)' #41

Open Sshubam opened 2 months ago

Sshubam commented 2 months ago

[NeMo I 2024-07-17 11:37:08 features:289] PADDING: 0 [NeMo I 2024-07-17 11:37:09 save_restore_connector:249] Model EncDecCTCModelBPE was successfully restored from conformer-or-ctc.nemo. [NeMo I 2024-07-17 11:37:10 collections:196] Dataset loaded with 309 files totalling 8583.33 hours [NeMo I 2024-07-17 11:37:10 collections:197] 0 files were filtered totalling 0.00 hours ERROR ([5.5]:CopyLaneCountersToHostSync():cudadecoder/cuda-decoder.cc:596) cudaError_t 700 : "an illegal memory access was encountered" returned from 'cudaStreamSynchronize(compute_st_)' Aborted (core dumped)

While decoding, I am getting this error.

CUDA info: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

Can anyone help if you have faced this issue before?

@galv

galv commented 2 months ago

I believe I encountered such an issue before, which I believe I fixed with https://github.com/nvidia-riva/riva-asrlib-decoder/commit/a3b5bbdf4a5246962fe463c6453b948e57cb2470 If you're not on the latest version on pypi, please consider updating. However, generally speaking the code is pretty stable by the industry customers using it and I haven't gotten any bug reports in a long time.

It looks like you are using nemo here. Do you feel comfortable sharing your dataset and what model you are using and any code you used to reproduce this? Since these workloads are data-dependent, it can be tricky to diagnose the exact problem without having the data and model and config to reproduce the issue.

Sshubam commented 2 months ago

I believe I encountered such an issue before, which I believe I fixed with a3b5bbd If you're not on the latest version on pypi, please consider updating. However, generally speaking the code is pretty stable by the industry customers using it and I haven't gotten any bug reports in a long time.

It looks like you are using nemo here. Do you feel comfortable sharing your dataset and what model you are using and any code you used to reproduce this? Since these workloads are data-dependent, it can be tricky to diagnose the exact problem without having the data and model and config to reproduce the issue.

Hey @galv , thanks for the reply. I am not on the latest version, so i think that might be the solution to it, actually, I am not familiar with CUDA, so can you help with what latest function of your library should i replace with [code for v0.2.0]

BatchedMappedDecoderCuda(config, TLG_file, words_file, 129).decode(
                            logits.to(torch.float32).to("cuda"),
                            logits_len.to(torch.int64).to("cpu"),
                        ) 

I see 3 functions in version 0.4.4 'decode_mbr', 'decode_nbest', 'decode_write_lattice'

Is it decode_nbest? Because this gives me the same error : ERROR ([5.5]:CopyLaneCountersToHostSync():cudadecoder/cuda-decoder.cc:602) cudaError_t 700 : "an illegal memory access was encountered" returned from 'cudaStreamSynchronize(compute_st_)'

BatchedMappedDecoderCuda(config, TLG_file, words_file, 129).decode_nbest(
                            logits.to(torch.float32).to("cuda"),
                            logits_len.to(torch.int64).to("cpu"),
                        )

Please let me know if you can provide a documentation reference for the same.

So are you aware of any other possible cases where this error could occur?

Sshubam commented 2 months ago

@galv Hey I found the solution to this issue.. The issue was being caused by number of blank tokens parameter that goes into BatchedMappedDecoderCuda() class. Can you please just let me know what function from the 0.4.4 library should i use because i cannot understand CUDA. Thanks a lot

galv commented 1 month ago

If you were using decoder() before, you want decode_mbr() in 0.4.4:

https://github.com/nvidia-riva/riva-asrlib-decoder/blob/c94c84efb3efb526ce87fa9728a3dd3e621bb484/src/riva/asrlib/decoder/python_decoder.cc#L321

instead of "decode" https://github.com/nvidia-riva/riva-asrlib-decoder/blob/39b6a2bd6c8f19c1f390e1e12da1ffeeb2585ba5/src/riva/asrlib/decoder/python_decoder.cc#L281

They should have the same interface.

Sshubam commented 1 month ago

If you were using decoder() before, you want decode_mbr() in 0.4.4:

https://github.com/nvidia-riva/riva-asrlib-decoder/blob/c94c84efb3efb526ce87fa9728a3dd3e621bb484/src/riva/asrlib/decoder/python_decoder.cc#L321

instead of "decode"

https://github.com/nvidia-riva/riva-asrlib-decoder/blob/39b6a2bd6c8f19c1f390e1e12da1ffeeb2585ba5/src/riva/asrlib/decoder/python_decoder.cc#L281

They should have the same interface.

@galv Im using the new function decode_mbr, but it is still stuck on this never ending loop: Screenshot 2024-07-24 at 11 20 52 AM

galv commented 1 month ago

Determinization is guaranteed to terminate on acyclic graphs, which our outputs graphs should always be. If you output a kaldi archive of lattices instead, you could verify that your lattice is acyclic: https://github.com/nvidia-riva/riva-asrlib-decoder/blob/c94c84efb3efb526ce87fa9728a3dd3e621bb484/src/riva/asrlib/decoder/python_decoder.cc#L289

You can try setting the determinize_lattice=False to try to work around it for now.

https://github.com/nvidia-riva/riva-asrlib-decoder/blob/c94c84efb3efb526ce87fa9728a3dd3e621bb484/src/riva/asrlib/decoder/test_graph_construction.py#L785C1-L786C1

https://github.com/nvidia-riva/riva-asrlib-decoder/blob/c94c84efb3efb526ce87fa9728a3dd3e621bb484/src/riva/asrlib/decoder/python_decoder.cc#L149

However, upon further reflection, I think I am familiar with this issue. I believe that CTC models are not suited to phone based determinization because phone determinization depends upon phones having "word boundary information", but we don't have that for CTC models, since they don't use triphones. Are you sure you have updated to a recent version of the library? I had a commit related to that here: https://github.com/nvidia-riva/riva-asrlib-decoder/commit/cdf9cdc4552e65d6d4ed72ba777ef7231e9512bc