te42kyfo / gpu-benches

collection of benchmarks to measure basic GPU capabilities
GNU General Public License v3.0
265 stars 41 forks source link

What's the difference between cuda-l2-cache and gpu-cache benchmarks? #3

Open beginlner opened 1 year ago

te42kyfo commented 1 year ago

I have written cuda-L2-cache specifically to benchmark the L2 cache bandwidth only. It simulates a scenario, where data is being read repeatedly from thread blocks on SMs all over the chip. The variable blockRun is adjusted to the total amount of simultaneously running thread blocks. Each thread reads N pieces of data, in a grid stride loop. Each piece of data is being read by 10000 different thread blocks (see line 57 int blockCount = blockRun * 10000;). By adjusting N, the total volume can be adjusted ( N * blockSize * blockRun * 8 Byte), which decides whether data fits in cache

Think of it as a grid stride loop over some data volume of as many threads as can run simultenously (a 'wave'). Only that afterwards, 10000 more waves will do the same thing, only that this time, the distribution of thread blocks is different. This is benchmark was written for this paper (see Figure 2). The peculiar way of how data is repeatedly read by different thread blocks is because of A100's segmented L2 cache, where L2 cache data repeatedly being read by the same thread block would show higher L2 cache capacity, because there would not be duplication. Whereas with this scheme, data would need to be duplicated, because reads come from SMs attached to different L2 cache segments.

The gpu-cache benchmark is a general cache benchmark for both L1 and L2 cache. Because each thread block reads the same data as all the others, it never falls out of L2 cache. Even if the data volume exceeds L2 cache capacity, there would be reuse in the L2 cache by different thread blocks.

guohaoqiang commented 11 months ago

I have written cuda-L2-cache specifically to benchmark the L2 cache bandwidth only. It simulates a scenario, where data is being read repeatedly from thread blocks on SMs all over the chip. The variable blockRun is adjusted to the total amount of simultaneously running thread blocks. Each thread reads N pieces of data, in a grid stride loop. Each piece of data is being read by 10000 different thread blocks (see line 57 int blockCount = blockRun * 10000;). By adjusting N, the total volume can be adjusted ( N * blockSize * blockRun * 8 Byte), which decides whether data fits in cache

Think of it as a grid stride loop over some data volume of as many threads as can run simultenously (a 'wave'). Only that afterwards, 10000 more waves will do the same thing, only that this time, the distribution of thread blocks is different. This is benchmark was written for this paper (see Figure 2). The peculiar way of how data is repeatedly read by different thread blocks is because of A100's segmented L2 cache, where L2 cache data repeatedly being read by the same thread block would show higher L2 cache capacity, because there would not be duplication. Whereas with this scheme, data would need to be duplicated, because reads come from SMs attached to different L2 cache segments.

The gpu-cache benchmark is a general cache benchmark for both L1 and L2 cache. Because each thread block reads the same data as all the others, it never falls out of L2 cache. Even if the data volume exceeds L2 cache capacity, there would be reuse in the L2 cache by different thread blocks.

When I run gpu-l2-cache on h100 PCIe, it exposes a weird bw column (the bw of L2 shall be around 7500GB/s). Do I need to change the code?

Screen Shot 2023-11-26 at 9 58 32 PM
te42kyfo commented 11 months ago

Your results look absolutely in line with what I had measured myself before. Regarding the very high numbers in the beginning: Initially, for the first few dataset sizes, there is still some coverage by the 256kB L1 cache. For example, the 2048kB data point consists of 8 blocks of 256kB, so there is a 1 in 8 chance that a thread block runs on a SM where the previous, just exited thread block had worked on the same block of data which then still resides in the L1 cache. It eventually settles to around 6700 GB/s, which is the pure L2 bandwidth.

For the data in the plot and the included data, I have changed the per thread block data set from 256kB to 512kB, exactly because of this reason. This reduces this effect, but doesn't eliminate it so you still should not use the first few values. Instead, use the values right before the dataset drops out of the L2 cache into memory. With 512kB per thread block, I get 7TB/s.

te42kyfo commented 11 months ago

The used parameters (256kB) had been fine before, but doesn't work as well for the increased L1 cache in H100. The CL replacement strategy might also have changed.