jeffhammond / HPCInfo

Information about many aspects of high-performance computing. Wiki content moved to ~/docs.
https://github.com/jeffhammond/HPCInfo/tree/master/docs
MIT License
277 stars 57 forks source link

question about gpu detect #8

Closed zjin-lcf closed 1 year ago

zjin-lcf commented 1 year ago

Do you have some ideas of computing peak memory bandwidth in a program when the memory type is HBM ? Thanks.

jeffhammond commented 1 year ago

I assume you mean theoretical peak. I don't know since the peak of the memory itself isn't always the peak achievable by the processor. In some cases, it cannot be determined with public information. Things like load queue depth and mesh latencies can limit achievable peak.

https://github.com/ParRes/Kernels/ nstream is how I measure achievable peak, which is obviously a lower bound on what the hardware can do. I use GEMM the same way for peak FMA rate.

zjin-lcf commented 1 year ago

(https://github.com/jeffhammond/HPCInfo/blob/master/cuda/gpu-detect.cu) computes DDR-based theoretical peak memory bandwidth. Could the program compute HBM-based peak memory bandwidth ? Thanks.

jeffhammond commented 1 year ago

It already does. See below for RTX 2060 and A100-80G output. As far as I know, peak bandwidth is correct.

$ ./gpu-detect
============= GPU number 0 =============
GPU name                                = NVIDIA GeForce RTX 2060.
Compute Capability (CC)                 = 7.5
memoryClockRate (CUDA device query)     = 5.501 GHz
memoryBusWidth (CUDA device query)      = 192 bits
peak bandwidth                          = 264.0 GB/s
totalGlobalMem                          = 5925 MiB
multiProcessorCount                     = 30
warpSize                                = 32
clockRate (CUDA device query)           = 1.560 GHz
FP64 FMA/clock per SM                   = 2
FP32 FMA/clock per SM                   = 64
GigaFP64/second per GPU                 = 187.2
GigaFP32/second per GPU                 = 5990.4
unifiedAddressing                       = 1
managedMemory                           = 1
pageableMemoryAccess                    = 0
pageableMemoryAccessUsesHostPageTables  = 0
concurrentManagedAccess                 = 1
canMapHostMemory                        = 1
$ ./gpu-detect
============= GPU number 0 =============
GPU name                                = NVIDIA A100 80GB PCIe.
Compute Capability (CC)                 = 8.0
memoryClockRate (CUDA device query)     = 1.512 GHz
memoryBusWidth (CUDA device query)      = 5120 bits
peak bandwidth                          = 1935.4 GB/s
totalGlobalMem                          = 81069 MiB
multiProcessorCount                     = 108
warpSize                                = 32
clockRate (CUDA device query)           = 1.410 GHz
FP64 FMA/clock per SM                   = 32
FP32 FMA/clock per SM                   = 64
GigaFP64/second per GPU                 = 9745.9
GigaFP32/second per GPU                 = 19491.8
unifiedAddressing                       = 1
managedMemory                           = 1
pageableMemoryAccess                    = 0
pageableMemoryAccessUsesHostPageTables  = 0
concurrentManagedAccess                 = 1
canMapHostMemory                        = 1
jeffhammond commented 1 year ago

I'm not sure where you got the idea that this code computes DDR-based theoretical peak memory bandwidth. I put code in their to handle Xavier and Orin AGX platforms correctly, because those have a single LPDDRx memory shared by the CPU and GPU.

zjin-lcf commented 1 year ago

I got it from the codes below. I assume that the theoretical peak memory bandwidth of a GPU is related to the type of the memory (DDR or HBM). Is that right ?

       // 2 for Dual Data Rate (https://en.wikipedia.org/wiki/Double_data_rate)
        // 1/8 = 0.125 for bit to byte
        printf("peak bandwidth                          = %.1f GB/s\n", 2 * memoryClock*1.e-6 * memoryBusWidth * 0.125);
jeffhammond commented 1 year ago

That's there because I needed a factor of 2 to make the numbers correct.

zjin-lcf commented 1 year ago

I think the memory bus width implies the type of memory a GPU is connected to. Thanks.

jeffhammond commented 1 year ago

https://en.wikipedia.org/wiki/High_Bandwidth_Memory says that HBM is a DDR memory, which confirms the above.

Each channel interface maintains a 128‑bit data bus operating at double data rate (DDR). HBM supports transfer rates of 1 GT/s per pin (transferring 1 bit), yielding an overall package bandwidth of 128 GB/s.

zjin-lcf commented 1 year ago

In our lab, the Xavier platform shows:

Device 0: "Xavier"
  CUDA Driver Version / Runtime Version          11.4 / 11.4
  CUDA Capability Major/Minor version number:    7.2
  Total amount of global memory:                 14907 MBytes (15631482880 bytes)
  (008) Multiprocessors, (064) CUDA Cores/MP:    512 CUDA Cores
  GPU Max Clock rate:                            1377 MHz (1.38 GHz)
  Memory Clock rate:                             1377 Mhz

However, you override the value.

        // Xavier AGX override
        if (major==7 && minor==2) {
            memoryClock = 2133000; // 2.133 GHz in Khz
            printf("memoryClockRate (Xavier AGX LPDDR4x)    = %.3f GHz\n",  memoryClock*1.e-6);
        }
jeffhammond commented 1 year ago

Yes, I did that because CUDA device query is not accurate for AGX systems.

https://info.nvidia.com/rs/156-OFN-742/images/Jetson_AGX_Xavier_New_Era_Autonomous_Machines.pdf says: "16GB 256-bit LPDDR4x @ 2133MHz 137 GB/s."

I have both a Xavier AGX and Orin AGX system in my home lab and I can measure bandwidth higher than is possible if 1.377 GHz was the memory clock.

zjin-lcf commented 1 year ago

Thank you for finding them. I hope that CUDA device query will show the accurate results for these platforms.

jeffhammond commented 1 year ago

I wouldn't count on that. Just use my code, because I wrote it specifically so that I could write a query that would give the right answers on all of the systems I care about.

zjin-lcf commented 1 year ago

Okay.