Closed Michelleable closed 7 months ago
RuntimeError: CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
I found that there will be no problem in the initial inference, but a CUDA error will be reported during subsequent inference. How can I solve this?
Hi, could you share more details about your platform and environment? Such as CUDA and PyTorch versions, and operating system.
Hi, could you share more details about your platform and environment? Such as CUDA and PyTorch versions, and operating system.
thanks for your time.
torch 2.1.0a0+b5021ba Driver Version: 470.199.02 CUDA Version: 12.1 Ubuntu 22.04.2 LTS
I tried to use both triton==3.0.0 and torch, but still failed.(it seem like that attention is after memoryunit operations)
thanks for your time.
torch 2.1.0a0+b5021ba Driver Version: 470.199.02 CUDA Version: 12.1 Ubuntu 22.04.2 LTS
I tried to use both triton==3.0.0 and torch, but still failed.(it seem like that attention is after memoryunit operations)
I noticed that you're using a driver version that's lower than what CUDA 12 requires. I'm not familiar with CUDA compatibility, but I think using PyTorch compiled with CUDA 12 might lead to some operations being incompatible. You could run a simple test code
import torch
data = torch.randn((32, 128, 128), dtype=torch.bfloat16, device='cuda')
torch.cuda.synchronize()
cpu_data = data.to("cpu")
torch.cuda.synchronize()
non_blocking_cpu_data = data.to("cpu", non_blocking=True)
torch.cuda.synchronize()
pin_data = cpu_data.pin_memory()
torch.cuda.synchronize()
non_blocking_pin_data = data.to("cpu", non_blocking=True).pin_memory()
torch.cuda.synchronize()
The issue might stem from non_blocking
or pin memory
not being supported.
If that's the case, modifying the code inf_llm/attention/context_manager.py
could work, though it might be slower slightly .
However, I'd still recommend using a fully supported CUDA version like 11.8.
thanks for your time. torch 2.1.0a0+b5021ba Driver Version: 470.199.02 CUDA Version: 12.1 Ubuntu 22.04.2 LTS I tried to use both triton==3.0.0 and torch, but still failed.(it seem like that attention is after memoryunit operations)
I noticed that you're using a driver version that's lower than what CUDA 12 requires. I'm not familiar with CUDA compatibility, but I think using PyTorch compiled with CUDA 12 might lead to some operations being incompatible. You could run a simple test code
import torch data = torch.randn((32, 128, 128), dtype=torch.bfloat16, device='cuda') torch.cuda.synchronize() cpu_data = data.to("cpu") torch.cuda.synchronize() non_blocking_cpu_data = data.to("cpu", non_blocking=True) torch.cuda.synchronize() pin_data = cpu_data.pin_memory() torch.cuda.synchronize() non_blocking_pin_data = data.to("cpu", non_blocking=True).pin_memory() torch.cuda.synchronize()
The issue might stem from
non_blocking
orpin memory
not being supported. If that's the case, modifying the codeinf_llm/attention/context_manager.py
could work, though it might be slower slightly . However, I'd still recommend using a fully supported CUDA version like 11.8.
Unfortunately… It is executable And when I tried infllm, my first inference is valid, but the error will be reported from the second time.
Unfortunately… It is executable And when I tried infllm, my first inference is valid, but the error will be reported from the second time.
Well, I think the optimized code currently has some compatibility issues.
You can try using the branch/init
to reproduce the results of our paper, but it may be somewhat slow.
We will enhance the code compatibility later on.
cpu_data = data.contiguous().to("cpu", non_blocking=True).pin_memory()