k2 1.15 potential GPU bug

yuekaizhang commented 2 years ago

Bug description: sending k2 H.pt to device or not would cause torch function nonzero() different output results.

Version info: v100 32 gb, version.txt

TO produce: pip3 install k2==1.15.1.dev20220604+cuda11.0.torch1.7.1 -f https://k2-fsa.org/nightly/, then unzip the input numpy file. Run the below codes.

import k2
import torch
import numpy as np
import math

device = torch.device('cuda:0')
len_dict = 4232
batch_size = 2

# This k2.ctc_topo function call would cause below assert error, however, if we comment this line, there would be no problem. k2 1.10 also works fine.

H = k2.ctc_topo(
          max_token=len_dict-1,
          modified=False,
          device=device,
      )

data = np.load('./input.npz')
ctc_log_probs = torch.from_numpy(data['ctc_log_probs']).to(device)
encoder_out_lens = torch.from_numpy(data['encoder_out_lens']).to(device)

condition = ctc_log_probs[0,:encoder_out_lens[0],0] < math.log(0.95)
assert len(torch.nonzero(condition, as_tuple=False)) > 0

If you can't reproduce this bug, I am using this docker image Dockerfile.txt

danpovey commented 2 years ago

So you are saying that just calling that line with H will cause an error in the later code even though the later code has nothing to do with H?

That might be some kind of memory overwrite or memory out-of-bounds thing. Possibly cuda-memcheck might tell us something.

yuekaizhang commented 2 years ago

Yes, even we do noting with H, it would cause the wrong output of torch.nonzero()

danpovey commented 2 years ago

Can you try to trace back the error as early as possible? I assume the output of np.load is being affected, e.g. data[ctc_log_probs] is being affected. You might be able to print out the first few elements and verify this. Try to figure out which elements are being affected.

Something else: see if just using that device for something else, like torch.zeros(), causes the error. It may not be specific to k2, it may be something about np.load, and device mismatch or something. Print out data['ctc_log_probs'].device, and see if it's also cuda:0.

Also it might be a good idea to run it in cuda-gdb or cuda-memcheck, if you can install them, e.g. cuda-memcheck --args python3 my_script.py (cuda-memcheck) r or something like that.

On Thu, Jun 30, 2022 at 1:37 AM Yuekai Zhang @.***> wrote:

Yes, even we do noting with H, it would cause the wrong output of torch.nonzero()

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/993#issuecomment-1170932375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3V4HPDJHG7MGB5V4TVRVMD5ANCNFSM52IO5O2A . You are receiving this because you commented.Message ID: @.***>

yuekaizhang commented 2 years ago

Can you try to trace back the error as early as possible? I assume the output of np.load is being affected, e.g. data[ctc_log_probs] is being affected. You might be able to print out the first few elements and verify this. Try to figure out which elements are being affected. Something else: see if just using that device for something else, like torch.zeros(), causes the error. It may not be specific to k2, it may be something about np.load, and device mismatch or something. Print out data['ctc_log_probs'].device, and see if it's also cuda:0. Also it might be a good idea to run it in cuda-gdb or cuda-memcheck, if you can install them, e.g. cuda-memcheck --args python3 my_script.py (cuda-memcheck) r or something like that. … On Thu, Jun 30, 2022 at 1:37 AM Yuekai Zhang @.> wrote: Yes, even we do noting with H, it would cause the wrong output of torch.nonzero() — Reply to this email directly, view it on GitHub <#993 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3V4HPDJHG7MGB5V4TVRVMD5ANCNFSM52IO5O2A . You are receiving this because you commented.Message ID: @.>

Thx Dan, I ran the below code with cuda-memcheck python3 issue.py > log.bad, here, attach the error log. log.bad.txt

import k2
import torch
import numpy as np

device = torch.device('cuda:2')
H = k2.ctc_topo(
          max_token=4231,
          modified=False,
          device=device,
      )

data = np.load('./input.npz')
ctc_log_probs = torch.from_numpy(data['ctc_log_probs'])
print(ctc_log_probs[0,0,:10])
ctc_log_probs = ctc_log_probs.to(device)
print(f"ctc_log_probs device: {ctc_log_probs.device}")
# print(ctc_log_probs[0,0,:10]), comment this line since it would stucked.

It looks like before ctc_log_probs = ctc_log_probs.to(device), the prints is normal and good. After sending it to the device, I can't even print them, which would stuck there. Change ctc_log_probs with torch.zeros() init has no any problem.

The error details are in log.bad.

danpovey commented 2 years ago

It says

========= This tool is deprecated and will be removed in a future release of the CUDA toolkit
========= Please use the compute-sanitizer tool as a drop-in replacement
========= Internal Memcheck Error: Initialization failed

... please see if there is such a thing as compute-sanitizer

yuekaizhang commented 2 years ago

It says

========= This tool is deprecated and will be removed in a future release of the CUDA toolkit
========= Please use the compute-sanitizer tool as a drop-in replacement
========= Internal Memcheck Error: Initialization failed

... please see if there is such a thing as compute-sanitizer

Sorry, I didn't noticed that previous problems come from cuda-memcheck itself. By setting CUDA_MEMCHECK_PATCH_MODULE=1, the error is solved. Now cuda-memcheck report 0 errors. Then I switched to compute-sanitizer which also reports 0 errors. However, the issue still exists. I guess I need to check it using cuda-gdb. Would post here if there is update.

danpovey commented 2 years ago

You should be able to print now, I hope-- see if the printed value of any of those variables is affected, and see which one is affected first. If you are not able to print even without cuda-gdb, I suspect the issue might be a hang/loop inside whichever kernel is used inside ctc_topo(). If you did export K2_SYNC_KERNELS=1 export CUDA_DEVICE_SYNCHRONIZE=1 you would probably see it hang in the call to ctc_topo(). @pkufool can you please see whether ctc_topo might have a problem in one of its kernels that could cause a loop/hang?

yuekaizhang commented 2 years ago

I put the ctc_topo call after the np.load related codes. This time it without print hang anymore, however, reports k2 errors: swipe.txt. Also, I tested the code, it works on k2 1.10, so I guess the issue is introduced after 1.10.

import k2
import torch
import numpy as np

device = torch.device('cuda:2')

# put ctc_topo call here would cause below print(ctc_log_probs[0,0,:10]) hang.
# H = k2.ctc_topo(
#           max_token=2,
#           modified=False,
#           device=device,
#       )

data = np.load('./input.npz')
ctc_log_probs = torch.from_numpy(data['ctc_log_probs'])
print(ctc_log_probs[0,0,:10])
ctc_log_probs = ctc_log_probs.to(device)
print(f"ctc_log_probs device: {ctc_log_probs.device}")
print(ctc_log_probs[0,0,:10])

H = k2.ctc_topo(
          max_token=2,
          modified=False,
          device=device,
      )

About the print(ctc_log_probs[0,0,:10]) hang, I further check using nsight system, it keeps execute the Device2Host kernel, like 2 mins later, it would print it successfully.

"export K2_SYNC_KERNELS=1, export CUDA_DEVICE_SYNCHRONIZE=1" didn't make the ctc_topo hang.

danpovey commented 2 years ago

@pkufool could you look at ctc_topo with an eye to any recent changes, and see if you can spot any potential problems?

danpovey commented 2 years ago

@yuekaizhang can you try with different values of max_token and see if it depends on that? You may be able to do some kind of bisection to find out the critical value at which there starts to be a problem.

danpovey commented 2 years ago

I suspect the issue may be in Eval2Device(). Look in eval.h. There are 3 branches in the code, and they may not all be as well tested. It might help to print out the sizes after GetBlockSizesForLambda2(), so we can know where to look for a potential error. You may even be able to set a breakpoint for GetBlockSizesForLambda2() in gdb, by going: gdb --args [program] [args] (gdb) b GetBlockSizesForLambda2 (gdb) r and print out the sizes it returns, e.g. do "return" when you hit it, and then print out the values it outputs using "p [expression]"

pkufool commented 2 years ago

@pkufool can you please see whether ctc_topo might have a problem in one of its kernels that could cause a loop/hang?

Sorry, I missed this message.

@pkufool could you look at ctc_topo with an eye to any recent changes, and see if you can spot any potential problems?

Ok, I am looking at the code.

yuekaizhang commented 2 years ago

@yuekaizhang can you try with different values of max_token and see if it depends on that? You may be able to do some kind of bisection to find out the critical value at which there starts to be a problem.

I traversed the numbers from (1,1000), 512 may be a critical value. Full log: max_tokens.txt

import k2
import torch
import numpy as np

device = torch.device('cuda:2')

# H = k2.ctc_topo(
#           max_token=2,
#           modified=False,
#           device=device,
#       )

data = np.load('./input.npz')
ctc_log_probs = torch.from_numpy(data['ctc_log_probs'])
print(ctc_log_probs[0,0,:10])
ctc_log_probs = ctc_log_probs.to(device)
print(f"ctc_log_probs device: {ctc_log_probs.device}")
print(ctc_log_probs[0,0,:10])
for i in range(1,1000):
    try:
        H = k2.ctc_topo(
                max_token=i,
                modified=False,
                device=device,
            )
        print("max_token:",i, "works.")
    except Exception as e:
        pass

pkufool commented 2 years ago

@yuekaizhang I tried all your bad cases and could not reproduce your issues. My k2 version is:

Collecting environment information...

k2 version: 1.15.1
Build type: Release
Git SHA1: f8d2dba06c000ffee36aab5b66f24e7c9809f116
Git date: Thu Apr 21 12:20:34 2022
Cuda used to build k2: 10.2
cuDNN used to build k2: 8.3.2
Python version used to build k2: 3.8
OS used to build k2: Ubuntu 18.04.5 LTS
CMake version: 3.10.2
GCC version: 7.5.0
CMAKE_CUDA_FLAGS:  --expt-extended-lambda -gencode arch=compute_70,code=sm_70 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-unused-variable  -Wno-strict-overflow 
PyTorch version used to build k2: 1.10.0+cu102
PyTorch is using Cuda: 10.2
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800

While yours is:

Collecting environment information...

k2 version: 1.15.1
Build type: Release
Git SHA1: c11c0b70e91d24935514b73d6bffddc8f5a07932
Git date: Sat Jun 4 14:06:20 2022
Cuda used to build k2: 11.0
cuDNN used to build k2: 8.0.5
Python version used to build k2: 3.8
OS used to build k2: Ubuntu 18.04.6 LTS
CMake version: 3.23.2
GCC version: 7.5.0
CMAKE_CUDA_FLAGS:  --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-unused-variable  -Wno-strict-overflow 
PyTorch version used to build k2: 1.7.1+cu110
PyTorch is using Cuda: 11.0
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800
k2 abort: False

There are only several commits between Apr 21 and Jun 4, I did not see any problems in the code.

Wait a minute, I will setup your k2 version, see if can reproduce your problems.

danpovey commented 2 years ago

It might be something to do with the device properties. What device?

yuekaizhang commented 2 years ago

It might be something to do with the device properties. What device?

I am using v100 32gb.

danpovey commented 2 years ago

One possibility is that it is a bug in RowIdsToRowSplits() that just happens to affect this exact size. RowIdsToRowSplits() invokes mgpu::sorted_search(...). The bug, if any, would likely be either in there or in how we invoked it (there is some subtlety about the indexes, I think). It could also have to do with what specific version of the kernel is being compiled and called, i.e. with the -gencode options and the architectures it uses; you might be using a more recent architecture there, like sm_75, that could be activating some subtle bug in moderngpu.

pkufool commented 2 years ago

I can reproduce the issues with @yuekaizhang 's k2 version. debugging...

k2-fsa / k2

k2 1.15 potential GPU bug #993