Some performance bottlenecks for UAV loads

ash3D commented 7 years ago

Thanks for useful tool. I added support for UAV loads in fork https://github.com/ash3D/perftest/tree/UAV_load (branch UAV_load). The results turned out to be somewhat slower than SRV on NVIDIA Kepler (GeForce GTX 760M). Previously I obtained higher performance with UAV compared to SRV under certain conditions in similar benchmark. So I started to experiment with shaders and eventually came up to about 2X speedup. The things I tried:

Loop unrolling. This improved 1d and 2d raw buffer loads but significantly worsened 3d and 4d loads on Kepler. Unrolling typed UAV buffer and texture loads resulted in crashes during benchmark execution on Kepler (but Intel worked well). I also tried partial loop unrolling. This eliminated big slowdown for 3d/4d loads on Kepler but in general partial unrolling performance was closer to original one (without unrolling), often slower. Different unroll factors worked best for different conditions (load width, access pattern) in somewhat unpredicted way.
Loop iteration count reduction. Big 3d/4d loads slowdown on Kepler with unrolled loop suggested to reduce iteration count. This unexpectedly also improved 1d/2d performance (2X scaledown led to >2X performance). Even more unexpectedly it improved performance of original loop without unrolling. Such behavior was detected on Kepler only. I tested Intel and Fermi a little bit and found mostly linear performance scaling there.
Remove read start address and address mask. Reading start address from cbuffer (used for unaligned tests) harmed UAV performance even if value is 0. Address mask which is intended to prevent compiler from merging multiple narrow loads also affected wide loads performance. It seems than NVIDIA GPUs perform wide raw buffer loads sequentially anyway so performance gains from removing address mask here apparently comes from something other. There are other places though where address mask apparently actually prevents narrow loads merging on Kepler (e.g. scalar 8/16/32 bit texture SRV loads). Removing address mask also fixed big slowdown for 3d/4d loads on Kepler with unrolled loop. I experimented with other address mask application - &= ~mask instead of |= mask. It unexpectedly improved performance. In some specific cases performance oddly became better than even without mask at all.

The modifications I mentioned also affected SRV performance in some extent but UAV performance was much more sensitive.

The results ultimately became close to expected theoretical peak rates of Kepler architecture. NVIDIA GPUs implements SRV loads in read-only TMU pipeline thus performance is different compared to CUDA which uses read/write LSU pipeline. It also differs significantly from AMD GCN. All of the 4 32-bit fetch units used for bilinear texture sampling can be utilized for buffer accesses in GCN (for wide loads/stores or coalesced 1d access). NVIDIA TMU fetch units are 64-bit beginning with Fermi (it able to filter 64-bit RGBA16F textures at full rate) but apparently only 1 of 4 used for buffer reads. I have observed similar behavior before with GT200 except its' fetch units are 32-bit. UAV accesses are served by LSU pipeline on NVIDIA GPUs. Kepler has 2:1 LD/ST to TMU ratio but UAV cached in L2 only. Initially UAV loads was slower then SRV in the benchmark but after shaders modifications I described above UAV performance became faster then SRV for invariant loads. Ratio is still not 2X but close to it. Linear and random UAV load performance varied in wide range (probably due to increased L2 access rate) and can be much faster or slower then SRV in different cases. SRV performance is very stable (it is the same for invariant/linear/random reads). I also tested NVIDIA Fermi2 (GeForce GTX 460) a little. Fermi has L1 cache for LSU pipeline (combined with shared memory) thus UAV performance appeared to be better. Invariant UAV reads are 2X faster then SRV ones. Linear and random UAV read performance still not as stable as SRV one but much better then on Kepler. Also Fermi is not subject to big performance drop for 3d/4d UAV raw loads with long unrolled loop.

sebbbi commented 7 years ago

Thanks for the detailed info.

CUDA has different memory model as DirectX/OpenGL/Vulkan (raw data vs data fetched using resource descriptors). It makes sense that Nvidia has different hardware paths for these. I am just used to AMDs architecture, which is more generic.

The loops should definitely be slightly unrolled, because most GPUs benefit from being able to issue multiple loads at once and then wait for them at once. With loop, you issue each load separately and then wait for them. Because this loop doesn't have any ALU to hide the load latency, it would definitely be better to unroll at least by 2x or 4x. But it obviously increases the register pressure so results are highly GPU and compiler specific.

The start address and the address mask are there to avoid compiler merging multiple 1d loads to wider 4d loads. This is because I want to benchmark 1d load performance and wide (2d and 4d) load performance separately. If the compiler would be allowed to merge them, all linear tests become wide 4d load tests. This is obviously not the intention.

The big perf changes caused by different masking operators are very strange indeed. ALU should be irrelevant here, because the L1$ latency is much higher than any ALU latency. The memory access pattern doesn't change at all either by masking changes, so it must be that the compiler optimizes the loops differently based on different masking. Maybe it partially unrolls some cases, because the loop length (256) is known at compile time. If this test app was DX12 based, we could use the new DX12 PIX to see Nvidia shader compiler output to see what's happening.

The length of the loop should not matter for performance (unless it becomes very short), since the test case is designed to fit fully to the compute unit's L1 cache. Also the group should be wide enough to fulfill all possible coalescing requirements. This is another strange result (for Fermi). Obviously in real workloads, larger loop size would often mean larger working set for each group, potentially trashing the L1$ (depending of course whether there's more data locality in program order vs neighboring groups).

DX11 has loose default guarantees of data visibility. Data written by another thread group is not guaranteed to be visible to other thread groups during the same dispatch, unless "globallycoherent" buffer attribute is used and memory barrier instruction is used. This allows AMD to use L1$ for UAV reads safely (common case = not globallycoherent). This works fine, since a single group always runs fully on the same CU (= single L1 cache). However UAV writes are a bit more complicated matter, especially since there's separate K$ for the scalar units that is not coherent with L1$. I would assume that AMD shader compiler marks UAVs that are written and then driver does some pessimistic assumptions. GCN asm also has tags for individual load/store instructions that allow changing the cache protocol.

ash3D commented 7 years ago

GPUs can potentially hide fetch latency even within loops - it cat switch to another warp/wavefront if results for current one are not yet ready (AMD GCN able to keep up to 10 wavefronts in flight per CU). But it is limited by available GPU resources, primarily by GPR count. I am not sure if it is enough to completely hide latency in such synthetic shader, it would be interesting to find it out for different GPUs. Real-world shaders with fetch loops can behave differently because other parts of shader can require many GPRs which can limit occupancy and thus GPU's ability to hide latency in fetch-intensive part. Aside from fetch latency there is another potential cause to limit performance in looped fetches - loop instructions overhead. Modern GPUs have much higher ALU throughput relative to fetch units so it is less likely to became a bottleneck but it worth to not rule it out beforehand especially considering presence of mask operations. Also loop require counter which increases GPR pressure. AMD GCN though can utilize scalar hardware for loops. It has dedicated scheduler, ALU and GPRs for scalar operations so presumably loops should impact less on GCN. Results with several loop unroll factors on Kepler:

test	loop	unroll 2x	unroll 4x	unroll 8x	unroll 16x	unroll 32x	unroll 64x	unroll 128x
Load1 raw32 SRV invariant	6.373ms	8.366ms	7.591ms	7.184ms	6.690ms	6.525ms	6.674ms	6.535ms
Load1 raw32 SRV linear	6.382ms	7.681ms	7.246ms	6.914ms	6.590ms	6.604ms	6.577ms	6.630ms
Load1 raw32 SRV random	6.380ms	7.918ms	7.222ms	6.957ms	6.568ms	6.654ms	6.680ms	6.642ms
Load2 raw32 SRV invariant	12.816ms	14.778ms	13.632ms	12.783ms	12.685ms	12.825ms	12.703ms	12.714ms
Load2 raw32 SRV linear	12.837ms	13.010ms	13.432ms	12.546ms	12.825ms	12.923ms	12.695ms	12.894ms
Load2 raw32 SRV random	12.856ms	13.638ms	12.783ms	12.685ms	12.919ms	12.920ms	12.715ms	12.912ms
Load3 raw32 SRV invariant	19.140ms	20.302ms	19.381ms	19.372ms	18.988ms	19.043ms	18.916ms	19.237ms
Load3 raw32 SRV linear	19.003ms	19.814ms	19.621ms	19.449ms	18.974ms	18.926ms	18.973ms	19.331ms
Load3 raw32 SRV random	19.104ms	19.801ms	19.550ms	19.713ms	18.969ms	19.083ms	18.973ms	19.183ms
Load4 raw32 SRV invariant	25.385ms	24.999ms	25.819ms	25.279ms	25.615ms	25.077ms	25.075ms	25.952ms
Load4 raw32 SRV linear	25.214ms	26.428ms	24.990ms	25.002ms	25.220ms	25.143ms	25.100ms	25.866ms
Load4 raw32 SRV random	25.450ms	26.221ms	25.580ms	25.088ms	25.144ms	25.136ms	24.981ms	25.887ms
Load2 raw32 SRV unaligned invariant	12.790ms	14.737ms	13.620ms	12.768ms	12.660ms	12.799ms	12.658ms	12.705ms
Load2 raw32 SRV unaligned linear	12.839ms	13.017ms	13.431ms	12.552ms	12.840ms	12.924ms	12.693ms	12.913ms
Load2 raw32 SRV unaligned random	12.872ms	13.642ms	12.759ms	12.684ms	12.906ms	12.922ms	12.704ms	12.919ms
Load4 raw32 SRV unaligned invariant	25.377ms	24.984ms	25.822ms	25.247ms	25.629ms	25.074ms	25.048ms	25.937ms
Load4 raw32 SRV unaligned linear	25.228ms	26.420ms	24.999ms	25.025ms	25.222ms	25.138ms	25.097ms	25.826ms
Load4 raw32 SRV unaligned random	25.446ms	26.223ms	25.584ms	25.058ms	25.145ms	25.117ms	24.979ms	26.002ms
Load1 raw32 UAV invariant	7.426ms	15.748ms	8.515ms	9.142ms	8.729ms	8.225ms	7.753ms	7.354ms
Load1 raw32 UAV linear	8.580ms	13.297ms	9.067ms	9.012ms	9.031ms	7.998ms	8.054ms	8.688ms
Load1 raw32 UAV random	8.142ms	11.089ms	8.578ms	8.653ms	8.850ms	8.017ms	7.784ms	7.871ms
Load2 raw32 UAV invariant	12.652ms	15.405ms	13.237ms	13.021ms	12.040ms	11.824ms	11.635ms	12.344ms
Load2 raw32 UAV linear	19.151ms	22.428ms	18.591ms	19.376ms	19.463ms	19.624ms	19.026ms	19.295ms
Load2 raw32 UAV random	14.627ms	15.899ms	13.663ms	15.008ms	14.653ms	14.937ms	13.770ms	14.473ms
Load3 raw32 UAV invariant	16.093ms	18.822ms	19.332ms	15.598ms	17.608ms	16.624ms	17.486ms	16.825ms
Load3 raw32 UAV linear	31.417ms	31.759ms	30.891ms	34.139ms	32.422ms	31.837ms	31.848ms	33.753ms
Load3 raw32 UAV random	24.703ms	24.644ms	24.559ms	26.153ms	24.781ms	24.649ms	24.844ms	25.823ms
Load4 raw32 UAV invariant	20.992ms	22.776ms	23.374ms	20.036ms	22.344ms	20.589ms	21.439ms	21.325ms
Load4 raw32 UAV linear	38.415ms	41.822ms	39.110ms	39.276ms	39.965ms	37.856ms	37.838ms	43.763ms
Load4 raw32 UAV random	36.847ms	35.885ms	40.019ms	39.353ms	38.120ms	37.326ms	37.838ms	39.280ms
Load2 raw32 UAV unaligned invariant	12.642ms	15.420ms	13.265ms	13.028ms	12.024ms	11.809ms	11.629ms	12.349ms
Load2 raw32 UAV unaligned linear	19.148ms	22.409ms	18.587ms	19.379ms	19.478ms	19.587ms	19.021ms	19.281ms
Load2 raw32 UAV unaligned random	14.633ms	15.949ms	13.728ms	15.169ms	14.669ms	14.924ms	13.776ms	14.468ms
Load4 raw32 UAV unaligned invariant	21.006ms	22.771ms	23.357ms	20.039ms	22.323ms	20.572ms	21.416ms	21.302ms
Load4 raw32 UAV unaligned linear	38.396ms	41.807ms	39.128ms	39.139ms	39.841ms	37.941ms	37.765ms	43.759ms
Load4 raw32 UAV unaligned random	36.810ms	35.923ms	40.066ms	39.351ms	38.079ms	37.335ms	36.161ms	39.251ms

I understand the purpose of the address masks but it seems that it slows down performance even for 3d/4d loads in some cases. At the same time I noticed other cases where the mask is required to prevent scalar loads merging (SRV texture loads). So for now I have not found single solution that work well for all cases. Start address offset which is used to test unaligned loads is much simpler. Reading it from cbuffer also slows down performance in some cases but it is possible to hardcode it in shader which solves the problem.

I suppose that performance dependency on loop length on Kepler is not related to access pattern. Initially I started experiments with loop length after I had fully unrolled loop and noticed poor performance for 3d/4d raw UAV loads. I suspected that the reason was instruction cache overflow and tried to reduce iteration count. But later I observed that loop length also affects performance for loop without unrolling. Maybe compiler chooses different optimization strategies regarding loop unrolling or register allocation. This behavior is specific for Kepler, other GPUs behaves more predictably. Address mask options also affects how different loop techniques performs relatively to each other.

Some results for raw buffer loads on Kepler:

test	256 iterations	128 iterations	256 iterations (full unroll)	128 iterations (full unroll)	256 iterations (full unroll), no address mask, hardcoded start address	128 iterations (full unroll), no address mask, hardcoded start address
Load1 raw32 SRV invariant	6.400ms	3.250ms	6.375ms	3.235ms	6.220ms	3.160ms
Load1 raw32 SRV linear	6.416ms	3.231ms	6.406ms	3.226ms	6.415ms	3.227ms
Load1 raw32 SRV random	6.408ms	3.257ms	6.404ms	3.241ms	6.393ms	3.237ms
Load2 raw32 SRV invariant	12.819ms	6.273ms	12.836ms	6.285ms	12.317ms	6.198ms
Load2 raw32 SRV linear	12.864ms	6.298ms	12.951ms	6.286ms	12.609ms	6.287ms
Load2 raw32 SRV random	12.893ms	6.296ms	12.795ms	6.273ms	12.634ms	6.285ms
Load3 raw32 SRV invariant	19.167ms	9.337ms	21.096ms	9.379ms	18.410ms	9.228ms
Load3 raw32 SRV linear	19.019ms	9.425ms	21.681ms	9.401ms	20.362ms	9.301ms
Load3 raw32 SRV random	19.104ms	9.420ms	21.929ms	9.392ms	20.259ms	9.332ms
Load4 raw32 SRV invariant	25.410ms	12.590ms	25.907ms	12.524ms	25.241ms	12.279ms
Load4 raw32 SRV linear	25.268ms	12.602ms	25.872ms	12.549ms	29.409ms	12.508ms
Load4 raw32 SRV random	25.462ms	12.662ms	25.826ms	12.587ms	29.306ms	12.465ms
Load2 raw32 SRV unaligned invariant	12.827ms	6.280ms	12.984ms	6.287ms	12.312ms	6.178ms
Load2 raw32 SRV unaligned linear	12.865ms	6.289ms	13.026ms	6.296ms	12.562ms	6.291ms
Load2 raw32 SRV unaligned random	12.896ms	6.299ms	12.882ms	6.277ms	12.557ms	6.312ms
Load4 raw32 SRV unaligned invariant	25.438ms	12.569ms	25.783ms	12.578ms	25.328ms	12.307ms
Load4 raw32 SRV unaligned linear	25.262ms	12.598ms	25.662ms	12.629ms	29.815ms	12.499ms
Load4 raw32 SRV unaligned random	25.456ms	12.648ms	25.699ms	12.571ms	29.230ms	12.575ms
Load1 raw32 UAV invariant	7.445ms	2.929ms	5.877ms	2.965ms	3.835ms	1.897ms
Load1 raw32 UAV linear	8.571ms	3.659ms	6.922ms	3.609ms	7.002ms	3.606ms
Load1 raw32 UAV random	8.152ms	3.247ms	6.504ms	3.288ms	5.810ms	2.978ms
Load2 raw32 UAV invariant	12.635ms	6.349ms	10.008ms	4.927ms	8.001ms	3.738ms
Load2 raw32 UAV linear	19.169ms	9.749ms	20.133ms	10.181ms	13.367ms	5.036ms
Load2 raw32 UAV random	14.647ms	7.167ms	14.906ms	7.492ms	10.048ms	4.049ms
Load3 raw32 UAV invariant	16.063ms	8.272ms	28.251ms	7.995ms	12.016ms	5.590ms
Load3 raw32 UAV linear	31.307ms	18.860ms	39.848ms	18.180ms	32.767ms	18.305ms
Load3 raw32 UAV random	24.728ms	13.587ms	32.018ms	12.095ms	25.325ms	11.909ms
Load4 raw32 UAV invariant	21.034ms	10.630ms	34.702ms	10.193ms	17.560ms	7.844ms
Load4 raw32 UAV linear	38.511ms	25.450ms	56.849ms	25.428ms	40.164ms	16.478ms
Load4 raw32 UAV random	36.839ms	18.898ms	44.861ms	20.109ms	32.801ms	13.220ms
Load2 raw32 UAV unaligned invariant	12.660ms	6.347ms	10.019ms	4.930ms	7.834ms	3.567ms
Load2 raw32 UAV unaligned linear	19.183ms	9.598ms	20.152ms	10.184ms	13.394ms	4.907ms
Load2 raw32 UAV unaligned random	14.646ms	7.184ms	14.911ms	7.480ms	14.328ms	6.906ms
Load4 raw32 UAV unaligned invariant	21.020ms	10.793ms	34.542ms	10.205ms	17.393ms	7.949ms
Load4 raw32 UAV unaligned linear	38.540ms	25.279ms	57.208ms	25.609ms	40.402ms	16.881ms
Load4 raw32 UAV unaligned random	36.862ms	18.877ms	44.867ms	19.949ms	39.201ms	18.490ms

sebbbi / perftest

Some performance bottlenecks for UAV loads #3