thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"
MIT License
309 stars 29 forks source link

cuda error #19

Closed Michelleable closed 7 months ago

Michelleable commented 8 months ago

cpu_data = data.contiguous().to("cpu", non_blocking=True).pin_memory()

Michelleable commented 8 months ago

RuntimeError: CUDA error: operation not supported CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Michelleable commented 8 months ago

I found that there will be no problem in the initial inference, but a CUDA error will be reported during subsequent inference. How can I solve this?

guyan364 commented 8 months ago

Hi, could you share more details about your platform and environment? Such as CUDA and PyTorch versions, and operating system.

Michelleable commented 8 months ago

Hi, could you share more details about your platform and environment? Such as CUDA and PyTorch versions, and operating system.

thanks for your time.

torch 2.1.0a0+b5021ba Driver Version: 470.199.02 CUDA Version: 12.1 Ubuntu 22.04.2 LTS

I tried to use both triton==3.0.0 and torch, but still failed.(it seem like that attention is after memoryunit operations)

guyan364 commented 8 months ago

thanks for your time.

torch 2.1.0a0+b5021ba Driver Version: 470.199.02 CUDA Version: 12.1 Ubuntu 22.04.2 LTS

I tried to use both triton==3.0.0 and torch, but still failed.(it seem like that attention is after memoryunit operations)

I noticed that you're using a driver version that's lower than what CUDA 12 requires. I'm not familiar with CUDA compatibility, but I think using PyTorch compiled with CUDA 12 might lead to some operations being incompatible. You could run a simple test code

import torch
data = torch.randn((32, 128, 128), dtype=torch.bfloat16, device='cuda')
torch.cuda.synchronize()
cpu_data = data.to("cpu")
torch.cuda.synchronize()
non_blocking_cpu_data = data.to("cpu", non_blocking=True)
torch.cuda.synchronize()
pin_data = cpu_data.pin_memory()
torch.cuda.synchronize()
non_blocking_pin_data = data.to("cpu", non_blocking=True).pin_memory()
torch.cuda.synchronize()

The issue might stem from non_blocking or pin memory not being supported. If that's the case, modifying the code inf_llm/attention/context_manager.py could work, though it might be slower slightly . However, I'd still recommend using a fully supported CUDA version like 11.8.

Michelleable commented 8 months ago

thanks for your time. torch 2.1.0a0+b5021ba Driver Version: 470.199.02 CUDA Version: 12.1 Ubuntu 22.04.2 LTS I tried to use both triton==3.0.0 and torch, but still failed.(it seem like that attention is after memoryunit operations)

I noticed that you're using a driver version that's lower than what CUDA 12 requires. I'm not familiar with CUDA compatibility, but I think using PyTorch compiled with CUDA 12 might lead to some operations being incompatible. You could run a simple test code

import torch
data = torch.randn((32, 128, 128), dtype=torch.bfloat16, device='cuda')
torch.cuda.synchronize()
cpu_data = data.to("cpu")
torch.cuda.synchronize()
non_blocking_cpu_data = data.to("cpu", non_blocking=True)
torch.cuda.synchronize()
pin_data = cpu_data.pin_memory()
torch.cuda.synchronize()
non_blocking_pin_data = data.to("cpu", non_blocking=True).pin_memory()
torch.cuda.synchronize()

The issue might stem from non_blocking or pin memory not being supported. If that's the case, modifying the code inf_llm/attention/context_manager.py could work, though it might be slower slightly . However, I'd still recommend using a fully supported CUDA version like 11.8.

Unfortunately… It is executable And when I tried infllm, my first inference is valid, but the error will be reported from the second time.

guyan364 commented 8 months ago

Unfortunately… It is executable And when I tried infllm, my first inference is valid, but the error will be reported from the second time.

Well, I think the optimized code currently has some compatibility issues. You can try using the branch/init to reproduce the results of our paper, but it may be somewhat slow. We will enhance the code compatibility later on.