llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.32k stars 12.11k forks source link

Missed optimization opportunity for loads with a dynamic index with a known small value range (e.g 0,1). #51734

Open Artem-B opened 3 years ago

Artem-B commented 3 years ago
Bugzilla Link 52392
Version trunk
OS All
CC @alinas,@hfinkel,@RKSimon,@nikic,@rotateright

Extended Description

Loads from an read-only array of known values with a dynamic index which is known to have only few values (e.g 0 and 1), results in an unnecessary read from memory.

E.g: https://godbolt.org/z/4r675d6dM

// This keeps 'a' on stack and reads return value from there.
int slow(int num, int v1, int v2) {
    int a[2] = {v1, v2};
    return a[num&1];
}
//slow (int, int, int):
//        mov     dword ptr [rsp - 8], esi
//        mov     dword ptr [rsp - 4], edx
//        and     edi, 1
//        mov     eax, dword ptr [rsp + 4*rdi - 8]
//        ret
// If we add more code to manually fetch values from 'a', we end up 
// with returning one v1/v1 directly and no longer need 'a'.
int fast(int num, int v1, int v2) {
    int a[2] = {v1, v2};
    int i0 = a[0];
    int i1 = a[1];
    return (num&1) ? i1:i0;
}
// fast (int, int, int):
//        mov     eax, esi
//        test    dil, 1
//        cmovne  eax, edx
//        ret

Being able to get rid of local variables is rather important for performance on GPUs (NVPTX, AMDGPU) and picking one of the few items in an array is a fairly common pattern.

Artem-B commented 3 years ago

I could use some advice here.

I've implemented a fairly brute-force optimization to deal with that set-of-constants GEP indices: https://reviews.llvm.org/differential/diff/387439/

It transforms such an index into a chain of select(icmp(index, N), GEP(N), select(icmp(index N+1), GEP(N+1).... This produces rather verbose, but simple IR: https://alive2.llvm.org/ce/z/78oC-U Linear comparison could be optimized into a binary search later.

Subsequent EarlyCSE eliminates always-false selects and DCE cleans up unused compares: https://godbolt.org/z/acYT7Kjeq

Remaining select of GEPs with constant offsets improves SROA's ability to deal with allocas that we index into dynamically. E.g: https://godbolt.org/z/7G5qh6EcE

The problem is that so far I've found no better way to determine specific values for the dynamic index other than enumerate all possible values within small enough ConstantRange and rely on earlyCSE to delete the ones that are never used.

KnownBits/ConstantRange track values by the bit ranges, so my patch can only deal with the constants that are close to each other and do not flip a lot of bits. I.e. it can handle {1024,1025}, but not {1023,1024} and is much better at dealing with small indices in general.

I wonder if there's a better way to identify the possible constant values. EarlyCSA can tell whether the index can ever be equal to particular value, so I could use something similar to skip some compare/select/gep elements.

The question is -- is it worth bothering with, considering that another existing path already knows how to deal with this?

Another concern is whether there may be cases where EarlyCSE may not be able to see through the icmp and will not remove unneeded selects.

Are there any advantages/disadvantages in using selects vs  single switch/phy ? Creating switch/phy was more complicated, so I went with select for now. Switch/phy may be better suited for larger ranges, but it may be more expensive if it gets lowered into conditional jumps. Or it may be OK, if we end up with predicated execution instead.

Thoughts? Suggestions?

Artem-B commented 3 years ago

This optimization appears to fit SeparateConstOffsetFromGEP pass quite well. I've got a WIP patch that seems to do the job. The pass runs pretty late, so it should not interfere with the canonical select-of-indices form mentioned in llvm/llvm-project#49527 .

For now I handle only 0/1 values, but I plan to experiment and check whether it's worth handling 2- or 3-bit indices as well. Local loads/stores on GPU are expensive enough to be worth quite a few extra instructions.

I'll send the patch for review next week.

Artem-B commented 3 years ago

SROA uses PtrUseVisitor.h to traverse pointer uses and to track the offset. In this case, it considers the offset to be unknown and the pointer escapes on load.

One way to deal with this would be to enhance PtrUseVisitor to track {use, offset} instead of just the use itself.

In a way, proposed IR transformation does the same, as it materializes pointer uses with specific index values and those can then be tracked by PtrUseVisitor.

Enhancing PtrUseVisitor may be a better approach, compared to canonicalizing gep use patterns, as it would improve LLVM's ability to analyze the IR in general.

IR transformation to select of geps, on the other hand, makes IR easier to analyze for some existing passes, but may also regress other passes. It's unclear whether select of geps is universally better than select of indices, or vice versa.

Artem-B commented 3 years ago

We might want to distinguish the "few values" case from exactly {0,1}.

Makes sense. {0,1} maps nicely to select.

Other cases are less straightforward. They could be generalized as a switch/phy, but it's not clear whether that would be an improvement in general. That said, it still does result in much better GPU code: https://godbolt.org/z/57nhfYr3q

If we choose 'tgt', the other transforms may already be in place to get us the fast code.

It would definitely help SROA which GPU back-ends heavily depend on.

rotateright commented 3 years ago

We might want to distinguish the "few values" case from exactly {0,1}. That is, we need to decide if a select-of-gep is canonical:

define i32* @src(i32 %x, [2 x i32]* %p) {
  %a = and i32 %x, 1
  %z = zext i32 %a to i64
  %g = getelementptr [2 x i32], [2 x i32]* %p, i64 0, i64 %z
  ret i32* %g
}

define i32* @tgt(i32 %x, [2 x i32]* %p) {
  %t = trunc i32 %x to i1
  %g0 = getelementptr [2 x i32], [2 x i32]* %p, i64 0, i64 0
  %g1 = getelementptr [2 x i32], [2 x i32]* %p, i64 0, i64 1
  %r = select i1 %t, i32* %g1, i32* %g0
  ret i32* %r
}

https://alive2.llvm.org/ce/z/9uxVWk

Currently, we don't transform either way. If we choose 'tgt', the other transforms may already be in place to get us the fast code.

Artem-B commented 3 years ago

assigned to @Artem-B