harvardnlp / genbmm

CUDA kernels for generalized matrix-multiplication in PyTorch
79 stars 13 forks source link

RuntimeError: CUDA error: an illegal memory access was encountered #12

Closed haozheji closed 2 years ago

haozheji commented 2 years ago

Got this error after a fix number of iterations even with different random seeds. However, the number of iteration depends on the batch size. The GPU is not out of memory, so I suspect that the bug comes from the matmul_cuda_kernel.cu?

Here is my environment version:

Python: 3.7.3
Pytorch: 1.6.0+cu101
Driver Version: 418.67       
CUDA Version: 10.1 (from nvidia-smi)
$CUDA_HOME: /usr/local/cuda-10.0
haozheji commented 2 years ago

When I further increase the batch size another error occurs and the GPU's memory is not run out either.

RuntimeError: CUDA error: invalid configuration argument

haozheji commented 2 years ago

I found that this error will raise when the batch size dimension is too large.

a = torch.randn(100000, 8, 8).cuda()
b = torch.randn(100000, 8, 8).cuda()
c = genbmm.logbmm(a, b)
print(c)

The following error raises:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jihaozhe/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 153, in __repr__
    return torch._tensor_str._str(self)
  File "/home/jihaozhe/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 371, in _str
    return _str_intern(self)
  File "/home/jihaozhe/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 351, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/jihaozhe/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 241, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/jihaozhe/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 273, in get_summarized_data
    return torch.stack([get_summarized_data(x) for x in (start + end)])
  File "/home/jihaozhe/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 273, in <listcomp>
    return torch.stack([get_summarized_data(x) for x in (start + end)])
  File "/home/jihaozhe/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 273, in get_summarized_data
    return torch.stack([get_summarized_data(x) for x in (start + end)])
  File "/home/jihaozhe/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 273, in <listcomp>
    return torch.stack([get_summarized_data(x) for x in (start + end)])
  File "/home/jihaozhe/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 266, in get_summarized_data
    return torch.cat((self[:PRINT_OPTS.edgeitems], self[-PRINT_OPTS.edgeitems:]))
RuntimeError: CUDA error: invalid configuration argument

The real batch size is not large actually (usually 32 or 16), but it has other length dimension. Since logbmm only allows input with three dimensions, I have to view() the input into three-dimension tensor with a very large batch size dimension.

haozheji commented 2 years ago

Further digging, this triggers the error:

>>> a = torch.randn(65536,8,8).cuda()
>>> b = torch.randn(65536,8,8).cuda()
>>> c = genbmm.logbmm(a, b)
>>> print(c)
...
RuntimeError: CUDA error: invalid configuration argument

65535 is just fine:

>>> a = torch.randn(65535,8,8).cuda()
>>> b = torch.randn(65535,8,8).cuda()
>>> c = genbmm.logbmm(a, b)
>>> print(c)
tensor([[[2.3122, 2.8272, 2.3992,  ..., 1.7824, 2.2578, 2.4881],
         ...
         [3.2658, 3.6260, 2.9816,  ..., 2.3903, 1.8778, 2.1133]]],
       device='cuda:0')

Seems like something exceeds the 16-bit limit?

haozheji commented 2 years ago

The problem is caused by the limit of grid size in dim y and dim z supported by CUDA. Reference image

However switching the batch size dimension from dim z to dim x (in the CUDA kernel source code) will sacrifice the speed due to discontiguous memory access I suppose.