tinygrad / open-gpu-kernel-modules

NVIDIA Linux open GPU with P2P support
Other
859 stars 75 forks source link

mapping of buffer object failed #16

Open DunkHimYo opened 1 month ago

DunkHimYo commented 1 month ago

NVIDIA Open GPU Kernel Modules Version

550.90.07

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Ubuntu 22.04.4 LTS

Kernel Release

Linux Product-Name 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

NVIDIA GeForce RTX 3090 x2

Describe the bug

-----------------------------------------------------ERROR MESSAGE-----------------------------------------------------

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, NVIDIA GeForce RTX 3090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0 Device: 1, NVIDIA GeForce RTX 3090, pciBusID: 2, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix D\D 0 1 0 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 830.23 6.13 1 6.14 832.89 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) Cuda failure p2pBandwidthLatencyTest.cu:189: 'mapping of buffer object failed'

Seem to detecet p2p but fails on test

To Reproduce

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A | | 0% 52C P8 18W / 350W | 15MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 Off | 00000000:02:00.0 Off | N/A | | 0% 53C P8 16W / 350W | 15MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1927 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 1927 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------------------+

I checked CUDA version, BAR support, and IOMMU off. Therefore, I don't know how to solve it.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

mylesgoose commented 2 weeks ago

hello this means you have not enabled large bar support in your bios. go to your bios and enable it. My computer gave the same results. i also had to add to the grub menu iommu=off. here was my test results if i disable it in the bios directional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 0 723.04 19.25 21.47 20.97 21.29 21.45 2.83 1 18.39 922.88 20.83 21.36 21.49 21.42 2.84 2 21.56 20.20 921.83 21.77 21.31 21.98 2.85 3 21.46 21.44 21.76 920.79 21.64 21.86 2.85 4 20.04 21.54 21.76 16.62 922.37 20.61 2.85 5 21.44 21.59 21.80 20.15 20.67 923.46 2.85 6 2.81 2.81 2.82 2.81 2.82 2.82 287.68 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) Cuda failure p2pBandwidthLatencyTest.cu:189: 'mapping of buffer object failed'

and her is my test results if i ./p2pBandwidthLatencyTest [P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0 Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 2, pciDeviceID: 0, pciDomainID:0 Device: 2, NVIDIA GeForce RTX 4090, pciBusID: 2c, pciDeviceID: 0, pciDomainID:0 Device: 3, NVIDIA GeForce RTX 4090, pciBusID: 41, pciDeviceID: 0, pciDomainID:0 Device: 4, NVIDIA GeForce RTX 4090, pciBusID: 61, pciDeviceID: 0, pciDomainID:0 Device: 5, NVIDIA GeForce RTX 4090, pciBusID: 62, pciDeviceID: 0, pciDomainID:0 Device: 6, NVIDIA GeForce GTX 1660 SUPER, pciBusID: 2a, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=0 CAN Access Peer Device=2 Device=0 CAN Access Peer Device=3 Device=0 CAN Access Peer Device=4 Device=0 CAN Access Peer Device=5 Device=0 CANNOT Access Peer Device=6 Device=1 CAN Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=1 CAN Access Peer Device=3 Device=1 CAN Access Peer Device=4 Device=1 CAN Access Peer Device=5 Device=1 CANNOT Access Peer Device=6 Device=2 CAN Access Peer Device=0 Device=2 CAN Access Peer Device=1 Device=2 CAN Access Peer Device=3 Device=2 CAN Access Peer Device=4 Device=2 CAN Access Peer Device=5 Device=2 CANNOT Access Peer Device=6 Device=3 CAN Access Peer Device=0 Device=3 CAN Access Peer Device=1 Device=3 CAN Access Peer Device=2 Device=3 CAN Access Peer Device=4 Device=3 CAN Access Peer Device=5 Device=3 CANNOT Access Peer Device=6 Device=4 CAN Access Peer Device=0 Device=4 CAN Access Peer Device=1 Device=4 CAN Access Peer Device=2 Device=4 CAN Access Peer Device=3 Device=4 CAN Access Peer Device=5 Device=4 CANNOT Access Peer Device=6 Device=5 CAN Access Peer Device=0 Device=5 CAN Access Peer Device=1 Device=5 CAN Access Peer Device=2 Device=5 CAN Access Peer Device=3 Device=5 CAN Access Peer Device=4 Device=5 CANNOT Access Peer Device=6 Device=6 CANNOT Access Peer Device=0 Device=6 CANNOT Access Peer Device=1 Device=6 CANNOT Access Peer Device=2 Device=6 CANNOT Access Peer Device=3 Device=6 CANNOT Access Peer Device=4 Device=6 CANNOT Access Peer Device=5

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix D\D 0 1 2 3 4 5 6 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 2 1 1 1 1 1 1 0 3 1 1 1 1 1 1 0 4 1 1 1 1 1 1 0 5 1 1 1 1 1 1 0 6 0 0 0 0 0 0 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 0 911.56 19.98 21.31 21.31 21.52 21.23 2.82 1 20.25 922.93 21.43 21.48 21.31 21.47 2.83 2 21.52 21.77 924.06 21.84 21.79 21.99 2.84 3 21.44 21.63 21.84 922.29 21.65 21.77 2.82 4 21.47 21.77 21.77 21.73 924.56 20.64 2.83 5 21.22 21.53 21.70 21.79 20.67 922.37 2.84 6 2.81 2.81 2.81 2.81 2.81 2.81 265.19 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 2 3 4 5 6 0 913.74 26.09 25.74 25.60 25.82 26.28 2.83 1 26.10 939.57 25.64 25.45 25.51 26.12 2.82 2 26.32 26.33 941.83 25.41 25.80 26.34 2.84 3 25.72 26.33 25.90 938.61 25.69 26.34 2.83 4 25.84 26.34 25.81 25.45 939.60 26.33 2.83 5 25.85 26.33 25.91 25.20 26.34 939.06 2.83 6 2.80 2.81 2.81 2.81 2.81 2.81 267.09 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 0 919.12 21.03 30.10 28.82 30.04 29.62 3.93 1 20.65 923.74 30.13 29.24 29.86 29.50 3.93 2 29.57 29.97 923.92 30.16 30.83 30.75 3.94 3 27.97 25.97 30.42 922.37 30.78 29.60 3.93 4 29.75 30.16 30.78 30.38 923.40 22.85 3.94 5 29.73 29.58 30.79 30.41 22.90 923.46 3.94 6 3.92 3.92 3.92 3.92 3.92 3.93 274.99 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 0 918.58 50.44 50.18 49.97 50.36 51.00 3.93 1 51.02 921.54 50.17 50.09 50.04 51.08 3.91 2 51.02 51.12 923.11 50.30 50.11 51.12 3.94 3 50.46 51.12 50.40 920.47 50.39 51.09 3.93 4 50.13 51.12 50.53 50.00 922.10 51.12 3.94 5 50.45 51.09 50.46 50.47 51.09 920.95 3.94 6 3.92 3.92 3.93 3.92 3.93 3.93 278.45 P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 4 5 6 0 1.51 11.52 11.51 11.51 15.04 11.60 15.30 1 11.91 1.46 11.53 11.34 11.35 11.44 15.54 2 11.78 11.78 1.37 20.18 11.64 11.34 16.35 3 11.33 11.58 20.09 1.52 11.36 11.51 16.32 4 11.52 11.59 11.34 11.34 1.55 20.00 15.72 5 16.51 11.52 11.59 20.09 11.82 1.48 15.91 6 16.11 15.91 16.29 16.76 16.28 16.20 1.24

CPU 0 1 2 3 4 5 6 0 2.82 17.00 17.01 9.00 49.89 17.39 8.97 1 9.07 2.90 16.86 8.64 16.89 25.06 17.02 2 17.12 8.96 2.85 8.78 16.99 16.87 8.92 3 16.88 8.71 16.80 2.70 8.56 32.37 8.85 4 8.78 8.68 33.99 37.31 2.78 8.59 24.49 5 8.74 8.65 16.45 8.44 26.98 2.77 21.66 6 8.84 29.51 8.96 24.81 8.69 16.66 2.73 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 2 3 4 5 6 0 1.52 1.00 1.23 1.23 1.26 1.26 16.49 1 1.04 1.46 1.24 1.23 1.23 1.23 16.31 2 1.18 1.16 1.44 1.18 1.17 1.19 15.82 3 1.23 1.24 1.27 1.52 1.24 1.23 15.29 4 1.16 1.18 1.18 1.18 1.48 1.02 15.78 5 1.28 1.19 1.14 1.16 1.04 1.47 15.54 6 15.60 15.57 15.38 16.46 15.23 16.29 1.25

CPU 0 1 2 3 4 5 6 0 2.89 10.48 2.40 2.47 2.54 2.43 8.98 1 2.57 2.83 2.48 18.27 2.46 2.49 9.25 2 2.54 2.46 2.87 2.45 16.38 2.45 21.76 3 2.47 2.37 10.50 2.86 2.41 2.35 8.83 4 2.41 2.37 2.41 2.43 2.76 2.32 8.90 5 10.55 2.33 2.30 2.35 2.36 2.74 24.83 6 30.80 8.93 17.37 17.04 8.71 24.96 11.09

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. enable it