tinygrad / open-gpu-kernel-modules

NVIDIA Linux open GPU with P2P support
Other
772 stars 57 forks source link

Getting RuntimeError: CUDA error: an illegal memory access was encountered with 3090s #4

Open murtaza-nasir opened 2 months ago

murtaza-nasir commented 2 months ago

NVIDIA Open GPU Kernel Modules Version

this one

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Ubuntu 22.04.4 LTS

Kernel Release

6.5.0-27-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

all (4x 3090)

Describe the bug

I installed this driver, and torch.cuda.can_device_access_peer(a, b) gives me TRUE for all gpus.

I get the following error when textgenwebui tries to load a model:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Aphrodite also crashes when loading any model.

To Reproduce

I installed this driver on ubuntu.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

dbateyko commented 2 months ago

fwiw, others are experiencing similar problems. I'm experiencing the same error in text-generation-ui at inference (using the ExLlamav2 model loader) after enabling resizable bars on two 3090s, but before installing this driver. It may be a problem in text-generation-ui? In any case, I'm following for a solution.

murtaza-nasir commented 2 months ago

fwiw, others are experiencing similar problems. I'm experiencing the same error in text-generation-ui at inference (using the ExLlamav2 model loader) after enabling resizable bars on two 3090s, but before installing this driver. It may be a problem in text-generation-ui? In any case, I'm following for a solution.

If you're referring to that post on y-combinator, that is me. I got this error after installing this driver.

geohot commented 2 months ago

This is only tested on 4090s, no idea if it works on anything else.

Though if you don't have large BAR on your 3090s, I can confirm it won't work.

murtaza-nasir commented 2 months ago

This is only tested on 4090s, no idea if it works on anything else.

Though if you don't have large BAR on your 3090s, I can confirm it won't work.

I did check with lspci and all my GPUs show the 32G line. Not sure why I'm getting this error. I'm on a fresh ubuntu install. I don't have IOMMU enabled in the ubuntu grub settings but I think I still didn't disable it in my BIOS. Will try that and see if that is the problem.

Edit: I disabled IOMMU in the BIOS but still see this error.

brthor commented 1 month ago

This is working for me with 3090s.

Didn't have to do anything but enable resizable BAR in the bios.

Ensure you have the correct driver version installed.

Low perf here is probably from the motherboard.

nvidia-smi

$ nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:0A:00.0 Off |                  N/A |
| 42%   37C    P0            115W /  350W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:0B:00.0 Off |                  N/A |
| 39%   33C    P0            115W /  350W |       0MiB /  24576MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
$ nvidia-smi topo -p2p rw
        GPU0    GPU1    
 GPU0   X       OK      
 GPU1   OK      X     

p2pBandwidthLatencyTest

$ ./cuda-samples/bin/x86_64/linux/release/p2pBandwidthLatencyTest 
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3090, pciBusID: a, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3090, pciBusID: b, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

...

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 829.79   6.14 
     1   6.14 831.55 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 821.94  13.18 
     1  13.18 832.81 

NCCL

$ ./build/all_reduce_perf -b 1G -e 8G -f 2 -g 2

#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
  1073741824     268435456     float     sum      -1   117080    9.17    9.17      0   116952    9.18    9.18      0
  2147483648     536870912     float     sum      -1   234000    9.18    9.18      0   233994    9.18    9.18      0
  4294967296    1073741824     float     sum      -1   468088    9.18    9.18      0   467922    9.18    9.18      0
dce51d9dafe1:992:992 [1] NCCL INFO comm 0x55a73639d730 rank 0 nranks 2 cudaDev 0 busId a000 - Destroy COMPLETE
dce51d9dafe1:992:992 [1] NCCL INFO comm 0x55a7363a3530 rank 1 nranks 2 cudaDev 1 busId b000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 9.17687 

vs. NCCL_P2P_DISABLE=1

$ NCCL_P2P_DISABLE=1 ./build/all_reduce_perf -b 1G -e 8G -f 2 -g 2

...

# Avg bus bandwidth    : 7.17877 
t13m commented 1 month ago

Hi @brthor, how did you enable large bar1 in 3090s? Can you share your method if you don't mind? Or is there any tutorial/instructions anywhere? Thank you!

murtaza-nasir commented 1 month ago

Hi @brthor, how did you enable large bar1 in 3090s? Can you share your method if you don't mind? Or is there any tutorial/instructions anywhere? Thank you!

Your GPU will have it if your motherboard supports it and you have it turned on.

t13m commented 1 month ago

Like turn it on in the BIOS of motherboard? Which motherboard are you using? Do GPU vbios or firmware need to be updated?

murtaza-nasir commented 1 month ago

Yes you just turn it on in BIOS. Make sure you have above 4G decoding and rebar support enabled. My TR Zenith II Extreme has it and the GPUs show large bar support. I have an EPYC supermicro H12SSLi that doesn't have rebar in the bios so the 3090s don't show it when checked.

t13m commented 1 month ago

It helps a lot, thank you!

brthor commented 1 month ago

@t13m Resizeable bar must be supported in the vBios of the gpu first of all, this has been the case with the 3090s I have.

If you don't have motherboard support you may be able to use https://github.com/xCuri0/ReBarUEFI

You can also try setting NVReg_EnableResizableBar=1 (do a google search where to set this, it is some modprobe.d file), but I didn't have success with this method.

scouzi1966 commented 4 weeks ago

I'm perplexed as to why isn't this more popular? Another question. Could I mix a 4090 with a 3090? What would be the drawbacks? I would like to get the benefits of more memory vs more performance. Is performance the only downside in running a 3090/4090 combo?

murtaza-nasir commented 4 weeks ago

I would like to get the benefits of more memory vs more performance. Is performance the only downside in running a 3090/4090 combo?

Yes if you have a 4090 and just want more memory, a 3090 will do that. However, you would be stuck at the 3090s performance level. I would personally prefer to have 3x 3090s vs 1x 4090 and 1x 3090.