Code sinking pessimizes common GPU code patterns

cwabbott0 commented 5 years ago


Bugzilla Link	43785
Version	unspecified
OS	Linux
CC	@hfinkel,@jayfoad,@arsenm,@nhaehnle,@trenouf

Extended Description

In GPU shaders used by games, it's pretty common to average a bunch of texture samples in a loop in the fragment shader, e.g. for blurring an image:

vec4 sample = vec4(0); for (int i = 0; i < NUM_SAMPLES; i++) sample += texture(...);

... ... = sample;

The problem happens after we unroll this loop, especially when doing very aggressive unrolling (as is usually beneficial on GPU's) and NUM_SAMPLES is very large (say, 8 to 16). If there's any intervening control flow, then LLVM's code sinking pass will try to pull the entire unrolled loop downwards, but since texture() is convergent, it can't, so it leaves just the texture operations:

sample1 = texture(...); sample2 = texture(...); ... sampleN = texture(...); ... ... = sample1 + sample2 + ... + sampleN;

And now all of the samples are live at the same time, which results in a huge increase in register pressure. We noticed this when changing RadeonSI to use NIR, since this resulted in a few shaders in Civilization: Beyond Earth spilling after unrolling started happening before LLVM.

I'm not entirely sure what to do here. Code sinking can definitely be beneficial in some cases for register pressure, and disabling it is largely a wash. We could just disable it in LLVM for AMDGPU/Mesa and do something much less aggressive in our stack, although this is a problem that likely affects other non-Mesa GPU stacks using LLVM.

jayfoad commented 5 years ago

The default machine scheduler does consider register pressure and latency and tries to cluster memory operations, so it has all the information it needs to do a good job here

Sorry, I take that back. I have only just noticed that "add" instructions are sunk into a different basic block, so there's no way a per-basic-block scheduler can move them back again.

jayfoad commented 5 years ago

These sorts of trade-offs are normally made in the scheduler

The default machine scheduler does consider register pressure and latency and tries to cluster memory operations, so it has all the information it needs to do a good job here, but I don't think it's sophisticated enough to make really good trade-offs. My gut feeling is that this is because it's a list scheduler, so it's always making local greedy decisions, and doesn't have a good way to balance their impact over a whole (large) basic block like yours.

Have you tried any alternative schedulers, e.g. with -mattr=si-scheduler ?

nhaehnle commented 5 years ago

Leaving the convergent issue aside, there could be other factors such as memory aliasing preventing the texture loads from being moved below the branch.

So it does seem that either the earlier code sinking should be register pressure aware or (if we consider the sinking to we a kind of canonicalization) we need some cross-basic block scheduling later which helps clean this situation up again.

We have run into examples in the past where being able to move a memory instruction across if-statements would have been beneficial to get more latency hiding...

cwabbott0 commented 5 years ago

This is a really interesting problem; take this with a grain of salt because I'm just basing this on your description, but in theory, I think that you want some mechanism to "spill into a reduction", where you recognize that, when spilling, the result is just going into a reduction, and perform the part of the reduction at the spill point instead of actually saving the value itself.

So, it's a bit more complicated than that. First off, this can cause problems even when it isn't so bad that you start spilling, since on AMDGPU and probably other GPU architectures, using more registers means that you have less threads running in parallel.

That's correct. On most GPUs, you have some number of registers per thread that can be used at maximum occupancy, and beyond that, using more registers hurts occupancy. The trade-offs here are complicated, and we might be able to do a better job generally, but that seems independent of this problem.

Of course, scheduling memory reads in parallel also helps, but only up to a point. These sorts of trade-offs are normally made in the scheduler, so having another component also try to might not be a great idea, or be pretty hard to pull off.

I don't understand what you mean, exactly, but if you're saying that for this to work then the register-pressure heuristic in the scheduler would need to be aware of it, then that makes sense to me. It's not clear to me that it's hard relative to other solutions here, but it would be some work. The good news is, however, that live ranges that end in reductions are easy to identify locally.

By "live ranges that end in reductions" you mean instructions where all the sources are killed? We could certainly try to hoist those up, and such a pass would probably help with this. And yes, what I meant is that the scheduler should be aware that it's possible to have both the original, low-register-pressure schedule and a more aggressive schedule where we execute more texture sample instructions in parallel, so that it can make the choice between them based off its own heuristics. If this hypothetical hoisting pass sees something like:

foo = texture(...) bar = texture(...) ... // in a different basic block baz = foo + bar

and doesn't hoist up the addition, then because the scheduler doesn't work across basic blocks, it forces RA to allocate different registers for foo and bar, so it's decided (in this instance) to put latency hiding over register pressure, and it doesn't have the sort of information that the scheduler has to decide if that's a good idea. We can't do it as part of spilling for the same reason.

Okay. What's the property of the sample intrinsic that allows the sinking even if it is convergence?

Sorry, I think I misunderstood your original proposal the first time, but for the sake of completeness: I was just referring to the fact that, for convergent operations, it's still possible to sink them across uniform branches. That is, for something like,

foo = texture(...) if (cond) { ... = foo; }

it's possible to sink the texture down to its use if (and only if) "cond" is the same for all threads. We probably can't start to do things like this until something like D68994 lands and the semantics of "convergent" are put on a more solid footing though.

hfinkel commented 5 years ago

This is a really interesting problem; take this with a grain of salt because I'm just basing this on your description, but in theory, I think that you want some mechanism to "spill into a reduction", where you recognize that, when spilling, the result is just going into a reduction, and perform the part of the reduction at the spill point instead of actually saving the value itself.

So, it's a bit more complicated than that. First off, this can cause problems even when it isn't so bad that you start spilling, since on AMDGPU and probably other GPU architectures, using more registers means that you have less threads running in parallel.

That's correct. On most GPUs, you have some number of registers per thread that can be used at maximum occupancy, and beyond that, using more registers hurts occupancy. The trade-offs here are complicated, and we might be able to do a better job generally, but that seems independent of this problem.

Of course, scheduling memory reads in parallel also helps, but only up to a point. These sorts of trade-offs are normally made in the scheduler, so having another component also try to might not be a great idea, or be pretty hard to pull off.

I don't understand what you mean, exactly, but if you're saying that for this to work then the register-pressure heuristic in the scheduler would need to be aware of it, then that makes sense to me. It's not clear to me that it's hard relative to other solutions here, but it would be some work. The good news is, however, that live ranges that end in reductions are easy to identify locally.

Secondly, what you're describing might not be possible. In my example, code sinking is being conservative and refusing to sink the sample intrinsics, but if the use is inside divergent control flow, then you legitimately can't sink them down. And if if we add the infrastructure to make it possible to determine when we can rematerialize the texture sample wrt convergence, then code sinking could use the same logic to determine when it can be sunk, and then we could fix those cases by making sinking more aggressive anyways.

Okay. What's the property of the sample intrinsic that allows the sinking even if it is convergence?

cwabbott0 commented 5 years ago

This is a really interesting problem; take this with a grain of salt because I'm just basing this on your description, but in theory, I think that you want some mechanism to "spill into a reduction", where you recognize that, when spilling, the result is just going into a reduction, and perform the part of the reduction at the spill point instead of actually saving the value itself.

So, it's a bit more complicated than that. First off, this can cause problems even when it isn't so bad that you start spilling, since on AMDGPU and probably other GPU architectures, using more registers means that you have less threads running in parallel. Of course, scheduling memory reads in parallel also helps, but only up to a point. These sorts of trade-offs are normally made in the scheduler, so having another component also try to might not be a great idea, or be pretty hard to pull off.

Secondly, what you're describing might not be possible. In my example, code sinking is being conservative and refusing to sink the sample intrinsics, but if the use is inside divergent control flow, then you legitimately can't sink them down. And if if we add the infrastructure to make it possible to determine when we can rematerialize the texture sample wrt convergence, then code sinking could use the same logic to determine when it can be sunk, and then we could fix those cases by making sinking more aggressive anyways.

hfinkel commented 5 years ago

This is a really interesting problem; take this with a grain of salt because I'm just basing this on your description, but in theory, I think that you want some mechanism to "spill into a reduction", where you recognize that, when spilling, the result is just going into a reduction, and perform the part of the reduction at the spill point instead of actually saving the value itself.

cwabbott0 commented 5 years ago

LLVM file

cwabbott0 commented 5 years ago

original GLSL source I made a simplified version of the shader that demonstrates the spilling on AMDGPU and attached both the GLSL source and ll file (with the loop already unrolled). This was the easiest for me to do, but if someone wants to the can probably rewrite the ll file to make LLVM unroll it instead.

nhaehnle commented 5 years ago

That sounds plausible. Do you have an .ll example that shows the problem?

llvm / llvm-project

Code sinking pessimizes common GPU code patterns #43130

Extended Description