Awesome project - Githubissues

boxerab commented 7 years ago

Thanks for making this available. Very helpful to see hard data backing up (or not) my instinct about memory speed for different image/buffer configurations.

I have a pair of RX 470s if you're interested in data for Polaris arch.

Aaron

sebbbi commented 7 years ago

GCN4 (Polaris) is missing from the list. Please download the newest version and send me the test results :)

boxerab commented 7 years ago

Here you go:

I chose results from the end of the output, to make sure the caches were warmed up.

//////////////////////////////////////////////////////////////////////////////////////////////////////////////// Adapters found: 0: Radeon (TM) RX 470 Graphics 1: Microsoft Basic Render Driver Using adapter 0

Load R8 invariant: 0.443ms Load R8 linear: 0.448ms Load R8 random: 1.734ms Load RG8 invariant: 1.950ms Load RG8 linear: 1.963ms Load RG8 random: 1.950ms Load RGBA8 invariant: 1.734ms Load RGBA8 linear: 1.761ms Load RGBA8 random: 1.735ms Load R16f invariant: 0.439ms Load R16f linear: 0.439ms Load R16f random: 1.732ms Load RG16f invariant: 1.947ms Load RG16f linear: 1.947ms Load RG16f random: 1.947ms Load RGBA16f invariant: 1.731ms Load RGBA16f linear: 1.734ms Load RGBA16f random: 2.214ms Load R32f invariant: 0.438ms Load R32f linear: 0.440ms Load R32f random: 1.734ms Load RG32f invariant: 1.950ms Load RG32f linear: 1.950ms Load RG32f random: 1.950ms Load RGBA32f invariant: 1.734ms Load RGBA32f linear: 1.735ms Load RGBA32f random: 1.948ms Load1 raw32 invariant: 0.538ms Load1 raw32 linear: 0.449ms Load1 raw32 random: 1.735ms Load2 raw32 invariant: 0.851ms Load2 raw32 linear: 1.948ms Load2 raw32 random: 1.950ms Load3 raw32 invariant: 0.665ms Load3 raw32 linear: 1.735ms Load3 raw32 random: 3.462ms Load4 raw32 invariant: 0.881ms Load4 raw32 linear: 1.735ms Load4 raw32 random: 1.938ms Load2 raw32 unaligned invariant: 0.541ms Load2 raw32 unaligned linear: 1.947ms Load2 raw32 unaligned random: 2.257ms Load4 raw32 unaligned invariant: 0.879ms Load4 raw32 unaligned linear: 1.733ms Load4 raw32 unaligned random: 2.140ms Tex2D load R8 invariant: 1.736ms Tex2D load R8 linear: 1.734ms Tex2D load R8 random: 2.597ms Tex2D load RG8 invariant: 1.948ms Tex2D load RG8 linear: 1.950ms Tex2D load RG8 random: 2.691ms Tex2D load RGBA8 invariant: 2.233ms Tex2D load RGBA8 linear: 1.732ms Tex2D load RGBA8 random: 2.382ms Tex2D load R16F invariant: 1.734ms Tex2D load R16F linear: 1.734ms Tex2D load R16F random: 2.598ms Tex2D load RG16F invariant: 1.950ms Tex2D load RG16F linear: 1.950ms Tex2D load RG16F random: 2.382ms Tex2D load RGBA16F invariant: 1.734ms Tex2D load RGBA16F linear: 1.734ms Tex2D load RGBA16F random: 3.570ms Tex2D load R32F invariant: 1.734ms Tex2D load R32F linear: 1.734ms Tex2D load R32F random: 2.381ms Tex2D load RG32F invariant: 2.223ms Tex2D load RG32F linear: 1.948ms Tex2D load RG32F random: 3.570ms Tex2D load RGBA32F invariant: 1.735ms Tex2D load RGBA32F linear: 2.598ms Tex2D load RGBA32F random: 3.463ms ////////////////////////////////////////////////////////////////////////////////////////////////

Surprisingly, Tex2D RGBA32F linear is significantly slower than buffer RGBA32f linear. My rule of thumb has always been to prefer textures over buffers, but it doesn't seem to hold for this card.

boxerab commented 7 years ago

By the way, I am in compute, where there is a lot of data movement, so loading non-cached data is the norm. For compute, these numbers would be quite different, I think. Would it be difficult to run perftest for compute workflow, where data is never in L1 cache ?

sebbbi commented 7 years ago

Tex2D and Buffer might produce different memory access pattern depending on how the GPU assigns the GroupThreadID to waves/warps. I use 1d thread groups (256,1,1) for linear buffer reads. Access is simply buffer[GroupThreadID.x]. For linear texture reads I use (16,16,1) thread groups. Access is texture2d[GroupThreadID.xy]. IFAIK AMD assigns threads inside a 2d group in scanline order to waves. Meaning that first 4 rows (4*16 lanes = 64 lanes = wave) is the first wave. Next 4 rows are second wave, etc. This is most likely not perfect for memory access patterns. You can improve the 2d access patterns by using SV_GroupIndex and morton swizzling its bits to create 2d coordinate. However there's no guarantee that SV_GroupID maps threads in linear scanline order to waves/warps. Apparently some GPUs internally map the threads differently (to improve the access pattern). This doesn't matter much with small 2d thread groups, but larger 2d groups + wide loads most likely hit more L1 cache lines than would be optimal.

I could add a morton swizzled test case to inspect this further.

Not all compute shaders behave like bulk memory copies. It's true that some shaders (like optimized prefix sums, radix sorters, etc work like this). Grid (or quad/octree) based compute algorithms, such as particle/fluid solvers have high L1 cache utilization. For example fetching neighbors from 3d grid is 3x3x3 = 27 loads from the grid. 26 of these fetches will fetch neighbors, only one will fetch the current tile. That's 96.2% cache hit ratio (L1). Octree based algorithms also read chain of parents that will very likely match neighbor thread accesses = near 100% cache hit ratio.

Optimized linear access pattern compute workloads are uninteresting to benchmark, as you are 100% memory bandwidth bound. Every GPU manages this. The only thing worth investigating is whether narrow loads (R8, RG8, R16) are enough to saturate the GPU bandwidth. Narrow loads are likely bottlenecked by load instruction issue rate. Wider loads are needed to saturate GPU bandwidth completely. I think it could be worth adding a single test case to inspect this behavior as well.

I have been lately optimizing a compute workload that writes to a huge R8 volume texture. It's not possible to reach maximum GPU memory bandwidth with R8 memory stores (even perfectly linear pattern). You would need coalescing, but AMD doesn't support coalescing with textures loads/stores. Same problem on NV and Intel. Impossible to saturate memory bandwidth with R8 loads/stores.

sebbbi commented 7 years ago

Thanks for the results. I will add them soon to main page.

boxerab commented 7 years ago

Thanks a lot for the detailed explanation. Reading up on morton swizzling now - was not familiar with that :)

Re: wider loads, yes it would be interesting to see how bandwidth changes as width increases, and what is the ideal width. For my own application (video encoder) on GCN, I've found that 64 bit linear loads are best, but this is mostly an intuition from playing around with my kernels, not backed by systematic tests.

As for lack of texture coalescing, I suppose since there is a texture cache, the cost of the initial read is offset by subsequent cached reads, but it would certainly be nice to have texture cache and coalesced reads.

sebbbi commented 7 years ago

GCN uses the same L1 cache for both buffers and textures. There's no separate texture cache.

There's a special fast path for linear buffer loads and stores. If all 64 threads in lane access consecutive addresses from memory, you get 4x faster throughput. See this AMD presentation: http://gpuopen.com/gcn-memory-coalescing/

Coalescing doesn't of course make bandwidth 4x faster, but linear access patterns are also good for bandwidth (as cache lines get fully read or written -- no wasted bytes loaded to caches).

Coalescing also helps with achieving peak memory bandwidth if you use narrow single channel loads/stores (R8, R16, R16F) as you are bottlenecked by issue rate instead of bandwidth.

boxerab commented 7 years ago

Thanks! Good to know. Someone should really write a book on "The Art of GCN Optimization". I would buy it.

I am about to start writing a kernel where there is not enough LDS memory for each work item, so I will need to use global memory as a cache and swap in and out between LDS and global memory. Using the linear access pattern for this cache should help make performance not too terrible.

boxerab commented 7 years ago

Thanks again!

sebbbi / perftest

Awesome project #1