Closed zjin-lcf closed 1 year ago
I assume you mean theoretical peak. I don't know since the peak of the memory itself isn't always the peak achievable by the processor. In some cases, it cannot be determined with public information. Things like load queue depth and mesh latencies can limit achievable peak.
https://github.com/ParRes/Kernels/ nstream is how I measure achievable peak, which is obviously a lower bound on what the hardware can do. I use GEMM the same way for peak FMA rate.
(https://github.com/jeffhammond/HPCInfo/blob/master/cuda/gpu-detect.cu) computes DDR-based theoretical peak memory bandwidth. Could the program compute HBM-based peak memory bandwidth ? Thanks.
It already does. See below for RTX 2060 and A100-80G output. As far as I know, peak bandwidth
is correct.
$ ./gpu-detect
============= GPU number 0 =============
GPU name = NVIDIA GeForce RTX 2060.
Compute Capability (CC) = 7.5
memoryClockRate (CUDA device query) = 5.501 GHz
memoryBusWidth (CUDA device query) = 192 bits
peak bandwidth = 264.0 GB/s
totalGlobalMem = 5925 MiB
multiProcessorCount = 30
warpSize = 32
clockRate (CUDA device query) = 1.560 GHz
FP64 FMA/clock per SM = 2
FP32 FMA/clock per SM = 64
GigaFP64/second per GPU = 187.2
GigaFP32/second per GPU = 5990.4
unifiedAddressing = 1
managedMemory = 1
pageableMemoryAccess = 0
pageableMemoryAccessUsesHostPageTables = 0
concurrentManagedAccess = 1
canMapHostMemory = 1
$ ./gpu-detect
============= GPU number 0 =============
GPU name = NVIDIA A100 80GB PCIe.
Compute Capability (CC) = 8.0
memoryClockRate (CUDA device query) = 1.512 GHz
memoryBusWidth (CUDA device query) = 5120 bits
peak bandwidth = 1935.4 GB/s
totalGlobalMem = 81069 MiB
multiProcessorCount = 108
warpSize = 32
clockRate (CUDA device query) = 1.410 GHz
FP64 FMA/clock per SM = 32
FP32 FMA/clock per SM = 64
GigaFP64/second per GPU = 9745.9
GigaFP32/second per GPU = 19491.8
unifiedAddressing = 1
managedMemory = 1
pageableMemoryAccess = 0
pageableMemoryAccessUsesHostPageTables = 0
concurrentManagedAccess = 1
canMapHostMemory = 1
I'm not sure where you got the idea that this code computes DDR-based theoretical peak memory bandwidth. I put code in their to handle Xavier and Orin AGX platforms correctly, because those have a single LPDDRx memory shared by the CPU and GPU.
I got it from the codes below. I assume that the theoretical peak memory bandwidth of a GPU is related to the type of the memory (DDR or HBM). Is that right ?
// 2 for Dual Data Rate (https://en.wikipedia.org/wiki/Double_data_rate)
// 1/8 = 0.125 for bit to byte
printf("peak bandwidth = %.1f GB/s\n", 2 * memoryClock*1.e-6 * memoryBusWidth * 0.125);
That's there because I needed a factor of 2 to make the numbers correct.
I think the memory bus width implies the type of memory a GPU is connected to. Thanks.
https://en.wikipedia.org/wiki/High_Bandwidth_Memory says that HBM is a DDR memory, which confirms the above.
Each channel interface maintains a 128‑bit data bus operating at double data rate (DDR). HBM supports transfer rates of 1 GT/s per pin (transferring 1 bit), yielding an overall package bandwidth of 128 GB/s.
In our lab, the Xavier platform shows:
Device 0: "Xavier"
CUDA Driver Version / Runtime Version 11.4 / 11.4
CUDA Capability Major/Minor version number: 7.2
Total amount of global memory: 14907 MBytes (15631482880 bytes)
(008) Multiprocessors, (064) CUDA Cores/MP: 512 CUDA Cores
GPU Max Clock rate: 1377 MHz (1.38 GHz)
Memory Clock rate: 1377 Mhz
However, you override the value.
// Xavier AGX override
if (major==7 && minor==2) {
memoryClock = 2133000; // 2.133 GHz in Khz
printf("memoryClockRate (Xavier AGX LPDDR4x) = %.3f GHz\n", memoryClock*1.e-6);
}
Yes, I did that because CUDA device query is not accurate for AGX systems.
https://info.nvidia.com/rs/156-OFN-742/images/Jetson_AGX_Xavier_New_Era_Autonomous_Machines.pdf says: "16GB 256-bit LPDDR4x @ 2133MHz 137 GB/s."
I have both a Xavier AGX and Orin AGX system in my home lab and I can measure bandwidth higher than is possible if 1.377 GHz was the memory clock.
Thank you for finding them. I hope that CUDA device query will show the accurate results for these platforms.
I wouldn't count on that. Just use my code, because I wrote it specifically so that I could write a query that would give the right answers on all of the systems I care about.
Okay.
Do you have some ideas of computing peak memory bandwidth in a program when the memory type is HBM ? Thanks.