sebbbi / perftest

GPU texture/buffer performance tester
MIT License
534 stars 26 forks source link

Some performance bottlenecks for UAV loads #3

Open ash3D opened 7 years ago

ash3D commented 7 years ago

Thanks for useful tool. I added support for UAV loads in fork https://github.com/ash3D/perftest/tree/UAV_load (branch UAV_load). The results turned out to be somewhat slower than SRV on NVIDIA Kepler (GeForce GTX 760M). Previously I obtained higher performance with UAV compared to SRV under certain conditions in similar benchmark. So I started to experiment with shaders and eventually came up to about 2X speedup. The things I tried:

The modifications I mentioned also affected SRV performance in some extent but UAV performance was much more sensitive.

The results ultimately became close to expected theoretical peak rates of Kepler architecture. NVIDIA GPUs implements SRV loads in read-only TMU pipeline thus performance is different compared to CUDA which uses read/write LSU pipeline. It also differs significantly from AMD GCN. All of the 4 32-bit fetch units used for bilinear texture sampling can be utilized for buffer accesses in GCN (for wide loads/stores or coalesced 1d access). NVIDIA TMU fetch units are 64-bit beginning with Fermi (it able to filter 64-bit RGBA16F textures at full rate) but apparently only 1 of 4 used for buffer reads. I have observed similar behavior before with GT200 except its' fetch units are 32-bit. UAV accesses are served by LSU pipeline on NVIDIA GPUs. Kepler has 2:1 LD/ST to TMU ratio but UAV cached in L2 only. Initially UAV loads was slower then SRV in the benchmark but after shaders modifications I described above UAV performance became faster then SRV for invariant loads. Ratio is still not 2X but close to it. Linear and random UAV load performance varied in wide range (probably due to increased L2 access rate) and can be much faster or slower then SRV in different cases. SRV performance is very stable (it is the same for invariant/linear/random reads). I also tested NVIDIA Fermi2 (GeForce GTX 460) a little. Fermi has L1 cache for LSU pipeline (combined with shared memory) thus UAV performance appeared to be better. Invariant UAV reads are 2X faster then SRV ones. Linear and random UAV read performance still not as stable as SRV one but much better then on Kepler. Also Fermi is not subject to big performance drop for 3d/4d UAV raw loads with long unrolled loop.

sebbbi commented 7 years ago

Thanks for the detailed info.

CUDA has different memory model as DirectX/OpenGL/Vulkan (raw data vs data fetched using resource descriptors). It makes sense that Nvidia has different hardware paths for these. I am just used to AMDs architecture, which is more generic.

The loops should definitely be slightly unrolled, because most GPUs benefit from being able to issue multiple loads at once and then wait for them at once. With loop, you issue each load separately and then wait for them. Because this loop doesn't have any ALU to hide the load latency, it would definitely be better to unroll at least by 2x or 4x. But it obviously increases the register pressure so results are highly GPU and compiler specific.

The start address and the address mask are there to avoid compiler merging multiple 1d loads to wider 4d loads. This is because I want to benchmark 1d load performance and wide (2d and 4d) load performance separately. If the compiler would be allowed to merge them, all linear tests become wide 4d load tests. This is obviously not the intention.

The big perf changes caused by different masking operators are very strange indeed. ALU should be irrelevant here, because the L1$ latency is much higher than any ALU latency. The memory access pattern doesn't change at all either by masking changes, so it must be that the compiler optimizes the loops differently based on different masking. Maybe it partially unrolls some cases, because the loop length (256) is known at compile time. If this test app was DX12 based, we could use the new DX12 PIX to see Nvidia shader compiler output to see what's happening.

The length of the loop should not matter for performance (unless it becomes very short), since the test case is designed to fit fully to the compute unit's L1 cache. Also the group should be wide enough to fulfill all possible coalescing requirements. This is another strange result (for Fermi). Obviously in real workloads, larger loop size would often mean larger working set for each group, potentially trashing the L1$ (depending of course whether there's more data locality in program order vs neighboring groups).

DX11 has loose default guarantees of data visibility. Data written by another thread group is not guaranteed to be visible to other thread groups during the same dispatch, unless "globallycoherent" buffer attribute is used and memory barrier instruction is used. This allows AMD to use L1$ for UAV reads safely (common case = not globallycoherent). This works fine, since a single group always runs fully on the same CU (= single L1 cache). However UAV writes are a bit more complicated matter, especially since there's separate K$ for the scalar units that is not coherent with L1$. I would assume that AMD shader compiler marks UAVs that are written and then driver does some pessimistic assumptions. GCN asm also has tags for individual load/store instructions that allow changing the cache protocol.

ash3D commented 7 years ago

GPUs can potentially hide fetch latency even within loops - it cat switch to another warp/wavefront if results for current one are not yet ready (AMD GCN able to keep up to 10 wavefronts in flight per CU). But it is limited by available GPU resources, primarily by GPR count. I am not sure if it is enough to completely hide latency in such synthetic shader, it would be interesting to find it out for different GPUs. Real-world shaders with fetch loops can behave differently because other parts of shader can require many GPRs which can limit occupancy and thus GPU's ability to hide latency in fetch-intensive part. Aside from fetch latency there is another potential cause to limit performance in looped fetches - loop instructions overhead. Modern GPUs have much higher ALU throughput relative to fetch units so it is less likely to became a bottleneck but it worth to not rule it out beforehand especially considering presence of mask operations. Also loop require counter which increases GPR pressure. AMD GCN though can utilize scalar hardware for loops. It has dedicated scheduler, ALU and GPRs for scalar operations so presumably loops should impact less on GCN. Results with several loop unroll factors on Kepler:

test loop unroll 2x unroll 4x unroll 8x unroll 16x unroll 32x unroll 64x unroll 128x
Load1 raw32 SRV invariant 6.373ms 8.366ms 7.591ms 7.184ms 6.690ms 6.525ms 6.674ms 6.535ms
Load1 raw32 SRV linear 6.382ms 7.681ms 7.246ms 6.914ms 6.590ms 6.604ms 6.577ms 6.630ms
Load1 raw32 SRV random 6.380ms 7.918ms 7.222ms 6.957ms 6.568ms 6.654ms 6.680ms 6.642ms
Load2 raw32 SRV invariant 12.816ms 14.778ms 13.632ms 12.783ms 12.685ms 12.825ms 12.703ms 12.714ms
Load2 raw32 SRV linear 12.837ms 13.010ms 13.432ms 12.546ms 12.825ms 12.923ms 12.695ms 12.894ms
Load2 raw32 SRV random 12.856ms 13.638ms 12.783ms 12.685ms 12.919ms 12.920ms 12.715ms 12.912ms
Load3 raw32 SRV invariant 19.140ms 20.302ms 19.381ms 19.372ms 18.988ms 19.043ms 18.916ms 19.237ms
Load3 raw32 SRV linear 19.003ms 19.814ms 19.621ms 19.449ms 18.974ms 18.926ms 18.973ms 19.331ms
Load3 raw32 SRV random 19.104ms 19.801ms 19.550ms 19.713ms 18.969ms 19.083ms 18.973ms 19.183ms
Load4 raw32 SRV invariant 25.385ms 24.999ms 25.819ms 25.279ms 25.615ms 25.077ms 25.075ms 25.952ms
Load4 raw32 SRV linear 25.214ms 26.428ms 24.990ms 25.002ms 25.220ms 25.143ms 25.100ms 25.866ms
Load4 raw32 SRV random 25.450ms 26.221ms 25.580ms 25.088ms 25.144ms 25.136ms 24.981ms 25.887ms
Load2 raw32 SRV unaligned invariant 12.790ms 14.737ms 13.620ms 12.768ms 12.660ms 12.799ms 12.658ms 12.705ms
Load2 raw32 SRV unaligned linear 12.839ms 13.017ms 13.431ms 12.552ms 12.840ms 12.924ms 12.693ms 12.913ms
Load2 raw32 SRV unaligned random 12.872ms 13.642ms 12.759ms 12.684ms 12.906ms 12.922ms 12.704ms 12.919ms
Load4 raw32 SRV unaligned invariant 25.377ms 24.984ms 25.822ms 25.247ms 25.629ms 25.074ms 25.048ms 25.937ms
Load4 raw32 SRV unaligned linear 25.228ms 26.420ms 24.999ms 25.025ms 25.222ms 25.138ms 25.097ms 25.826ms
Load4 raw32 SRV unaligned random 25.446ms 26.223ms 25.584ms 25.058ms 25.145ms 25.117ms 24.979ms 26.002ms
Load1 raw32 UAV invariant 7.426ms 15.748ms 8.515ms 9.142ms 8.729ms 8.225ms 7.753ms 7.354ms
Load1 raw32 UAV linear 8.580ms 13.297ms 9.067ms 9.012ms 9.031ms 7.998ms 8.054ms 8.688ms
Load1 raw32 UAV random 8.142ms 11.089ms 8.578ms 8.653ms 8.850ms 8.017ms 7.784ms 7.871ms
Load2 raw32 UAV invariant 12.652ms 15.405ms 13.237ms 13.021ms 12.040ms 11.824ms 11.635ms 12.344ms
Load2 raw32 UAV linear 19.151ms 22.428ms 18.591ms 19.376ms 19.463ms 19.624ms 19.026ms 19.295ms
Load2 raw32 UAV random 14.627ms 15.899ms 13.663ms 15.008ms 14.653ms 14.937ms 13.770ms 14.473ms
Load3 raw32 UAV invariant 16.093ms 18.822ms 19.332ms 15.598ms 17.608ms 16.624ms 17.486ms 16.825ms
Load3 raw32 UAV linear 31.417ms 31.759ms 30.891ms 34.139ms 32.422ms 31.837ms 31.848ms 33.753ms
Load3 raw32 UAV random 24.703ms 24.644ms 24.559ms 26.153ms 24.781ms 24.649ms 24.844ms 25.823ms
Load4 raw32 UAV invariant 20.992ms 22.776ms 23.374ms 20.036ms 22.344ms 20.589ms 21.439ms 21.325ms
Load4 raw32 UAV linear 38.415ms 41.822ms 39.110ms 39.276ms 39.965ms 37.856ms 37.838ms 43.763ms
Load4 raw32 UAV random 36.847ms 35.885ms 40.019ms 39.353ms 38.120ms 37.326ms 37.838ms 39.280ms
Load2 raw32 UAV unaligned invariant 12.642ms 15.420ms 13.265ms 13.028ms 12.024ms 11.809ms 11.629ms 12.349ms
Load2 raw32 UAV unaligned linear 19.148ms 22.409ms 18.587ms 19.379ms 19.478ms 19.587ms 19.021ms 19.281ms
Load2 raw32 UAV unaligned random 14.633ms 15.949ms 13.728ms 15.169ms 14.669ms 14.924ms 13.776ms 14.468ms
Load4 raw32 UAV unaligned invariant 21.006ms 22.771ms 23.357ms 20.039ms 22.323ms 20.572ms 21.416ms 21.302ms
Load4 raw32 UAV unaligned linear 38.396ms 41.807ms 39.128ms 39.139ms 39.841ms 37.941ms 37.765ms 43.759ms
Load4 raw32 UAV unaligned random 36.810ms 35.923ms 40.066ms 39.351ms 38.079ms 37.335ms 36.161ms 39.251ms

I understand the purpose of the address masks but it seems that it slows down performance even for 3d/4d loads in some cases. At the same time I noticed other cases where the mask is required to prevent scalar loads merging (SRV texture loads). So for now I have not found single solution that work well for all cases. Start address offset which is used to test unaligned loads is much simpler. Reading it from cbuffer also slows down performance in some cases but it is possible to hardcode it in shader which solves the problem.

I suppose that performance dependency on loop length on Kepler is not related to access pattern. Initially I started experiments with loop length after I had fully unrolled loop and noticed poor performance for 3d/4d raw UAV loads. I suspected that the reason was instruction cache overflow and tried to reduce iteration count. But later I observed that loop length also affects performance for loop without unrolling. Maybe compiler chooses different optimization strategies regarding loop unrolling or register allocation. This behavior is specific for Kepler, other GPUs behaves more predictably. Address mask options also affects how different loop techniques performs relatively to each other.

Some results for raw buffer loads on Kepler:

test 256 iterations 128 iterations 256 iterations (full unroll) 128 iterations (full unroll) 256 iterations (full unroll), no address mask, hardcoded start address 128 iterations (full unroll), no address mask, hardcoded start address
Load1 raw32 SRV invariant 6.400ms 3.250ms 6.375ms 3.235ms 6.220ms 3.160ms
Load1 raw32 SRV linear 6.416ms 3.231ms 6.406ms 3.226ms 6.415ms 3.227ms
Load1 raw32 SRV random 6.408ms 3.257ms 6.404ms 3.241ms 6.393ms 3.237ms
Load2 raw32 SRV invariant 12.819ms 6.273ms 12.836ms 6.285ms 12.317ms 6.198ms
Load2 raw32 SRV linear 12.864ms 6.298ms 12.951ms 6.286ms 12.609ms 6.287ms
Load2 raw32 SRV random 12.893ms 6.296ms 12.795ms 6.273ms 12.634ms 6.285ms
Load3 raw32 SRV invariant 19.167ms 9.337ms 21.096ms 9.379ms 18.410ms 9.228ms
Load3 raw32 SRV linear 19.019ms 9.425ms 21.681ms 9.401ms 20.362ms 9.301ms
Load3 raw32 SRV random 19.104ms 9.420ms 21.929ms 9.392ms 20.259ms 9.332ms
Load4 raw32 SRV invariant 25.410ms 12.590ms 25.907ms 12.524ms 25.241ms 12.279ms
Load4 raw32 SRV linear 25.268ms 12.602ms 25.872ms 12.549ms 29.409ms 12.508ms
Load4 raw32 SRV random 25.462ms 12.662ms 25.826ms 12.587ms 29.306ms 12.465ms
Load2 raw32 SRV unaligned invariant 12.827ms 6.280ms 12.984ms 6.287ms 12.312ms 6.178ms
Load2 raw32 SRV unaligned linear 12.865ms 6.289ms 13.026ms 6.296ms 12.562ms 6.291ms
Load2 raw32 SRV unaligned random 12.896ms 6.299ms 12.882ms 6.277ms 12.557ms 6.312ms
Load4 raw32 SRV unaligned invariant 25.438ms 12.569ms 25.783ms 12.578ms 25.328ms 12.307ms
Load4 raw32 SRV unaligned linear 25.262ms 12.598ms 25.662ms 12.629ms 29.815ms 12.499ms
Load4 raw32 SRV unaligned random 25.456ms 12.648ms 25.699ms 12.571ms 29.230ms 12.575ms
Load1 raw32 UAV invariant 7.445ms 2.929ms 5.877ms 2.965ms 3.835ms 1.897ms
Load1 raw32 UAV linear 8.571ms 3.659ms 6.922ms 3.609ms 7.002ms 3.606ms
Load1 raw32 UAV random 8.152ms 3.247ms 6.504ms 3.288ms 5.810ms 2.978ms
Load2 raw32 UAV invariant 12.635ms 6.349ms 10.008ms 4.927ms 8.001ms 3.738ms
Load2 raw32 UAV linear 19.169ms 9.749ms 20.133ms 10.181ms 13.367ms 5.036ms
Load2 raw32 UAV random 14.647ms 7.167ms 14.906ms 7.492ms 10.048ms 4.049ms
Load3 raw32 UAV invariant 16.063ms 8.272ms 28.251ms 7.995ms 12.016ms 5.590ms
Load3 raw32 UAV linear 31.307ms 18.860ms 39.848ms 18.180ms 32.767ms 18.305ms
Load3 raw32 UAV random 24.728ms 13.587ms 32.018ms 12.095ms 25.325ms 11.909ms
Load4 raw32 UAV invariant 21.034ms 10.630ms 34.702ms 10.193ms 17.560ms 7.844ms
Load4 raw32 UAV linear 38.511ms 25.450ms 56.849ms 25.428ms 40.164ms 16.478ms
Load4 raw32 UAV random 36.839ms 18.898ms 44.861ms 20.109ms 32.801ms 13.220ms
Load2 raw32 UAV unaligned invariant 12.660ms 6.347ms 10.019ms 4.930ms 7.834ms 3.567ms
Load2 raw32 UAV unaligned linear 19.183ms 9.598ms 20.152ms 10.184ms 13.394ms 4.907ms
Load2 raw32 UAV unaligned random 14.646ms 7.184ms 14.911ms 7.480ms 14.328ms 6.906ms
Load4 raw32 UAV unaligned invariant 21.020ms 10.793ms 34.542ms 10.205ms 17.393ms 7.949ms
Load4 raw32 UAV unaligned linear 38.540ms 25.279ms 57.208ms 25.609ms 40.402ms 16.881ms
Load4 raw32 UAV unaligned random 36.862ms 18.877ms 44.867ms 19.949ms 39.201ms 18.490ms