warp id computation patterns not leveraged to use SGPRs afterwards in AMDGPU backend

Hi,

I am writing HIP kernels and compiling with the LLVM packaged in ROCM 5.0.1 so (HIP version: 5.0.13601-bb16828d AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.1 22051 235b6880e2e515507478181ec11a20c1ec87945b).

I noticed that the typical CUDA patterns to compute a warp ID, used in many CUDA applications, is not leveraged to put the warp id in SGPRs and subsequent loads/stores using it as index are not using scalar operations.

I believe this is a very frequent pattern in CUDA.

Here is a test case:

__global__ void warpIdTest(float* warp_scales, float* lane_offsets, float* a, float* res, int N) {
  // test kernel to show whether the warp ID/ lane ID patterns get optimized to use
  // SGPRs for the warp ID

  // doesn't work with 1D indexing
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  // doesn't work with 1D indexing with 2D grid
  //int tid = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
  // doesn't workd with 1D indexing with 3D grid
  //int tid = blockIdx.x * blockDim.x * blockDim.y * blockDim.z
  //          + threadIdx.z * blockDim.y * blockDim.x
  //          + threadIdx.y * blockDim.x + threadIdx.x;

  // wid is the "warp ID" and could be put in a SGPR
  int wid = tid / warpSize;
  int lid = tid % warpSize;
  if (tid < N) {
    // warp_scales[wid] could be loaded with a scalar load
   // the fma should have a scalar register as operand
    res[tid] = lane_offsets[lid] + warp_scales[wid] * a[tid];
  }
}

Would it be possible to get some optimization for that?

Thanks in advance

@llvm/issue-subscribers-backend-amdgpu

@arsenm , sorry for the disturbance, but you seem to be a pretty prolific contributor for the AMDGPU backend. Could you indicate if such an optimization is theoretically possible? If not, then I would be happy to close the issue.

I think it is possible but would need to think about what the exact condition is. (blockidx blockDim) / warpsize -> (blockidx blockDim) / warpsize and (blockIdx * blockDim + threadId) % warpsize -> threadId seems right

@arsenm , @bcahoon , I just saw that in the rocWMMA library, there is some usage of the wave/lane IDs (saw it at https://github.com/ROCmSoftwarePlatform/rocWMMA/blob/develop/library/include/rocwmma/internal/mapping_util_impl.hpp ). Do those functions lead to optimized assembly using SGPRs for the wave ID?

There's nothing magic in that header.

I looked again and I'm not sure I see how this is possible. The divergence of the threadIdx doesn't go away on re-association of the division (which you can only do if the block size is a multiple of the warp size). If you want a warp index, you want to divide just the block part here.

@arsenm , I am not sure I understand your reply, especially the "divergence" ans "re-association" parts. Let me describe my assumptions on the hardware, my goal, and then why I think it is safe to do what I want to do based on my assumptions. Feel free to tell where I am wrong if I am, you for sure know much more about the hardware than me.

My assumption on the hardware are: 1) There is no "per thread" divergence, i.e. no such thing as per thread program counter like in post V100 Nvidia GPUs. If there was such thing, it wouldn't be safe for sure. (But I think the hardware most likely couldn't have that feature and SGPRs anyway); 2) on a wavefront, threads are all from the same block; 3) if a block has a size that is not a multiple of the SIMD size, it is padded with inactive threads; 4) blocks are scheduled on a single compute unit, and can be broken down in several wavefronts that can be run on several Simd units within the same compute unit

Let me know if those assumptions are wrong.

My goal is to basically get an identifier that can help me synchronize several wavefronts from the same block (each wavefront hopefully running concurrently on the different simds of the compute unit), e.g. to do per compute unit reductions. In the process, I would hope that some loads/stores become sgpr loads/stores. I know that values derived from block indices are for sure going to sgrp, it makes sense, it is good that it is happening already. I just want to go one step further and have values derived from effectively wavefront-homogeneous indices to use SGPRs too.

Any correctly linearized thread index (if 1D just threadIdx.x, if 2D threadIdx.y*blockDim.x + threadIdx.x, etc), divided by a multiple of the warp size, should give a wavefront homogeneous value as far as I understand. I currently am using inline ASM to read from the VGPRs the computed warp ID using v_readfirstlane (which seems like it is perfect for that). Everything seems fine so far when doing that.

The problem here is that the block size needs to be a multiple of the wave size to do anything with this, which is not guaranteed. There's no existing optimization hint to assume it either (maybe we could figure out __builtin_assume that reports this?). The thread ID value you're computing isn't dependent on how odd sized workgroups are organized

If you want the warp ID, why not just divide that part? i.e. wid = blockIdx.x * blockDim.x/ waveSize?

@arsenm , I don't really see why this is dependent on the block size, as long as my assumption 3) is correct? For example, if we have a block size of warpSize+1 , I assume that the hardware launches two wavefronts, one full with active threads, one with just one active. Still, the threads on the first wavefront can get wave ID 0, the ones on the second one can get wave ID 1.

What you propose is not meeting my goal of being able to assign a different ID to the different wavefronts from the same block. They would have all the same ID.

I feel like I'm missing something obvious here. Your proposal seems to make sense at first glance but I don't see what I can do with this when I look at it closer. The API here isn't exposing how the groups are padded and is in terms of the logical group size.

I've been trying to think of this in terms of factoring out the uniform and divergent part of the expression but maybe the trick is we just need to special case the divergence analysis for the expression. You still end up with a readfirstlane though.

I know that the API is not exposing this kind of details about how the blocks are formed and launched on hardware, but as far as I can tell, you are free to implement extra optimisations in the backend for the GPUs for which it would work (as far as I know, all GCN/RDNA GPUs), similarly to how you have implemented barrier elimination.

I am currently using inline ASM to use readfirstlane to transfer from vgpr to sgpr, and it works. I know that many cuda programs use this pattern though, and won't use inline ASM to optimize it. I think that as you said, detecting a "global thread ID" computation pattern (one of the three I posted in the first message) followed by a division by a (positive) multiple of the wavefront size then inserting the readfirstlane would be what is desired. It can also be applied to "local thread ID" computation patterns (e.g. threadIdx.x, or threadIdx.y*blockDim.x+threadIdx.x) followed by a division with a multiple of the wavefront size.

https://reviews.llvm.org/D124385 handles the simplest case for this

Very nice! Thanks a lot for your patience. I am not sure if I understood your changes correctly, but it seems like you also made the remainder of the division to be put in sgpr, but it is not uniform in the wave though, only the result of the division is uniform. Sorry if I confused you with my initial example about that part.

Also, I was wrong when saying that the "global thread ID" can be used for calculating a wave ID in the general case. As you mentioned earlier, that requires to have the blocks to be of a size that's a multiple of the SIMD size. Only local linear thread IDs can be used to compute a wave ID. Sorry about that.

Coming back to this patch, and I'm once again confused. Did you mean to compute a linear ID such as described in https://reviews.llvm.org/D124385#3484855?

Yes, exactly like that.

As mentioned in the LLVM review, I guess this optimization would require in HIP a kernel function attribute similar to reqd_work_group_size from OpenCL to be able to determine that the block size is a multiple of the wavefront size. I hope that something like that gets added too.

Yes, exactly like that.

But that's a different expressions from what you have here?

OK, let me try to summarize, for myself (I also got confused by what I wrote previously) and also for you.

In my original post, I guess I was showing computations with global IDs. but it is possible to also optimize with local IDs instead, and more easily as it doesn't require to know the exact block size.

For local IDs, computed as indicated in the first message:

  // 1D local ID
  int ltid = threadIdx.x;
  // 2D local ID
  int ltid = threadIdx.y * blockDim.x + threadIdx.x;
  // 3D local ID
  int ltid = (threadIdx.z * blockDim.y + threadIdx.y) * blockDim.x + threadIdx.x;

We can always know that ltid / K (where K is a multiple of the wavefront size) is uniform as long as we use a N-D index and the block is actually N-D. (So I guess we need to know if the kernel is launched with the correct block size dimensionality? So maybe we still need to know the block size for that?)

For global IDs, computed as indicated in the first message:

  // 1D global ID
  int blockId = blockIdx.x;
  int blockSize = blockDim.x;
  int ltid = threadIdx.x;
  int gtid = blockId * blockSize + ltid;

  // 2D global ID
  int blockId = blockIdx.y * gridDim.x + blockIdx.x ;
  int blockSize = blockDim.x * blockDim.y;
  int ltid = threadIdx.y * blockDim.x + threadIdx.x;
  int gtid = blockId * blockSize + ltid;

  // 3D global ID
  int blockId = (blockIdx.z * gridDim.y + blockIdx.y) *  gridDim.x + blockIdx.x;
  int blockSize = blockDim.x * blockDim.y * blockDim.z;
  int ltid = (threadIdx.z * blockDim.y  + threadIdx.y) * blockDim.x + threadIdx.x;
  int gtid = blockId * blockSize + ltid;

We can only know that gtid / K_1 (where K_1 is a multiple of the wavefront size) is uniform if we know that the block size is K_2 a multiple of the wavefront size itself (potentially different as far as I can tell), and that we use the N-D indexing with a N-D block. For that we need to know the block size.

I hope this time I got it write...

We can always know that ltid / K (where K is a multiple of the wavefront size) is uniform as long as we use a N-D index and the block is actually N-D. (So I guess we need to know if the kernel is launched with the correct block size dimensionality? So maybe we still need to know the block size for that?)

ltid is what Vulkan calls the LocalInvocationIndex and yes, that property holds. It may make sense for us to introduce a LocalInvocationIndex intrinsic (llvm.amdgcn.workitem.index?) as a higher-level representation and leverage that. But it's a non-trivial effort to get right for various reasons.

@nhaehnle , thanks for the reply. I totally understand that it is probably not trivial, but maybe there is a way to decompose in small-ish steps?

For example, just improving the case with 3D local ID computation might be a good place to start? If we compute it as:

  int ltid = (threadIdx.z * blockDim.y + threadIdx.y) * blockDim.x + threadIdx.x;

Then no matter if the actual block is 1D, 2D or 3D (number of dimensions different from 1), it is fine anyway as the ID is linear in all cases. If there is a similar function attribute to reqd_work_group_size in OpenCL, then we can check that the block size K is a multiple of the wavefront and then can apply the optimization on ltid / K .

Given that we assume reqd_work_group_size to give the block size, then when computing ltid all multiplications are probably either shifts or multiplications with a constant, therefore maybe what needs to be matched is something like:

int ltid = threadIdx.z *C + threadIdx.y * B + threadIdx.x;

Where we should check B == blockDim.x , C == (blockDim.y * blockDim.x) .

Next step is then to add more intelligence for the cases where blockDim.y or blockDim.z are equal to 1, in which case threadIdx.y and threadIdx.z are always 0 and therefore do not need to be part of the computation of ltid.

At that point, we actually support all cases for the local ID.

Then next step is the support for the global ID cases, which seem a moderate step as again, with reqd_work_group_size applied, should mainly consist in checking that we are finding a pattern gtid = ltid + blockSize * D with D being whatever, but the blockSize matching what reqd_work_group_size gives or a multiple of it and the block size being a multiple of the wavefront size.

No idea if that makes any sense, I hope it does.

Best regards, Epliz

Yes, of course this can be done bit by bit. Again, I suspect it's best to introduce intrinsics to represent some of these things at a higher level. For example, having a "wave ID" intrinsic would help the instruction selection with understanding that an SGPR can be used even though some of the input values for the computation are in VGPRs. (Since we don't generically insert readfirstlane automatically in cases where a value is known to be uniform but the inputs to its calculation aren't.) Feel free to give it a try ;)

Hi @nhaehnle , @arsenm ,

I have been giving it a try, even though I don't know much about llvm so I am kind of doing it by extending @arsenm code and hoping for the best. I don't really have an idea of how to introduce an intrinsic properly, so I guess I will leave that to you if I manage to get my current approach working.

There seems to be some cases where even though the source code uses the base patterns, once reaching the backend some parts might be missing due to simplifications of the divisions and shifts using the knowledge of the block sizes (which is great in itself). For example (threadIdx.x +blockDim.x*threadIdx.y) / warpSize might already be simplified to threadIdx.y if blockDim.x is warpSize. I guess detecting the pattern much earlier and using the Intrinsic would solve that?

I also came to the realization that when we have the block size, it is also possible to know when threadIdx.y and threadIdx.z are uniform, and put something for that.

To test my changes, I basically made the reqd_work_group_size attribute to be allowed for HIP. However I imagine that a new attribute named reqd_block_dims might be better named for HIP. What would be the process for adding such thing?

Overall I am a bit surprised by how brittle things are with the detection of uniform expressions. I thought there would be some pass detecting what base expressions are uniform and propagate the information to expressions using them, but there doesn't seem to be such thing?

In any case, I can post my current patch here in the next few days, and I hope you can help me figure out if I am doing things totally wrong or not.

Best, Epliz

As far as the attribute is concerned, if you have a patch you can just post it to Phabricator. I think just lifting the current artificial restriction is probably fine.

The regular divergence analysis finds uniform expressions through data flow which covers most expressions. This is a special case that requires specific interpretation, so you just have to pattern match any possibly interesting expression

I don't really have an idea of how to introduce an intrinsic properly, so I guess I will leave that to you if I manage to get my current approach working.

Introducing an intrinsic properly isn't that difficult, you can follow existing examples.

What is your current approach, if you're not using new intrinsics?

I thought there would be some pass detecting what base expressions are uniform and propagate the information to expressions using them, but there doesn't seem to be such thing?

There is, but it's a standard data flow analysis fix point iteration that marks values as divergent if any of the inputs into their computation are divergent.

You now want to do sort of the opposite: take computations where the inputs are known to be divergent, but additional analysis can prove that the result is uniform anyway. The existing infrastructure simply isn't built for that, so an extension of some form is necessary.

That's why I'm curious about what your approach is. I tend to expect that any approach that's maintainable will use intrinsics in some form, so it makes me nervous when you say that you don't. Unless you're just inserting the existing readfirstlane intrinsic?

Hi, Here my current commits, which do not yet pass all the test cases I have written: 1) Just allowing reqd_work_group_size for HIP 2) Detecting when threadIdx.y and threadIdx.z are uniform (basic cases, could add a bit more there) 3) Detecting the local thread ID patterns (doesn't detect expanded expressions and a few more)

Patches: 1)

From f62fecbda80fd0e2cf6f326ff9d80722939ea0f5 Mon Sep 17 00:00:00 2001
From: Epliz <xxx@yyy.com>
Date: Sun, 29 Jan 2023 19:12:58 +0100
Subject: [PATCH 1/3] Propagate reqd_work_group_size even for HIP

---
 clang/lib/CodeGen/CodeGenFunction.cpp | 19 +++++++++++--------
 clang/lib/Sema/SemaDeclAttr.cpp       |  9 +++++++--
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/clang/lib/CodeGen/CodeGenFunction.cpp b/clang/lib/CodeGen/CodeGenFunction.cpp
index 55464e1b1636..84fde58334bc 100644
--- a/clang/lib/CodeGen/CodeGenFunction.cpp
+++ b/clang/lib/CodeGen/CodeGenFunction.cpp
@@ -589,6 +589,17 @@ void CodeGenFunction::EmitKernelMetadata(const FunctionDecl *FD,

   CGM.GenKernelArgMetadata(Fn, FD, this);

+  if (!getLangOpts().OpenCL && !getLangOpts().HIP)
+    return;
+
+  if (const ReqdWorkGroupSizeAttr *A = FD->getAttr<ReqdWorkGroupSizeAttr>()) {
+    llvm::Metadata *AttrMDArgs[] = {
+        llvm::ConstantAsMetadata::get(Builder.getInt32(A->getXDim())),
+        llvm::ConstantAsMetadata::get(Builder.getInt32(A->getYDim())),
+        llvm::ConstantAsMetadata::get(Builder.getInt32(A->getZDim()))};
+    Fn->setMetadata("reqd_work_group_size", llvm::MDNode::get(Context, AttrMDArgs));
+  }
+
   if (!getLangOpts().OpenCL)
     return;

@@ -615,14 +626,6 @@ void CodeGenFunction::EmitKernelMetadata(const FunctionDecl *FD,
     Fn->setMetadata("work_group_size_hint", llvm::MDNode::get(Context, AttrMDArgs));
   }

-  if (const ReqdWorkGroupSizeAttr *A = FD->getAttr<ReqdWorkGroupSizeAttr>()) {
-    llvm::Metadata *AttrMDArgs[] = {
-        llvm::ConstantAsMetadata::get(Builder.getInt32(A->getXDim())),
-        llvm::ConstantAsMetadata::get(Builder.getInt32(A->getYDim())),
-        llvm::ConstantAsMetadata::get(Builder.getInt32(A->getZDim()))};
-    Fn->setMetadata("reqd_work_group_size", llvm::MDNode::get(Context, AttrMDArgs));
-  }
-
   if (const OpenCLIntelReqdSubGroupSizeAttr *A =
           FD->getAttr<OpenCLIntelReqdSubGroupSizeAttr>()) {
     llvm::Metadata *AttrMDArgs[] = {
diff --git a/clang/lib/Sema/SemaDeclAttr.cpp b/clang/lib/Sema/SemaDeclAttr.cpp
index a303c7f57280..2f4895f16938 100644
--- a/clang/lib/Sema/SemaDeclAttr.cpp
+++ b/clang/lib/Sema/SemaDeclAttr.cpp
@@ -9376,14 +9376,19 @@ void Sema::ProcessDeclAttributeList(
   // good to have a way to specify "these attributes must appear as a group",
   // for these. Additionally, it would be good to have a way to specify "these
   // attribute must never appear as a group" for attributes like cold and hot.
-  if (!D->hasAttr<OpenCLKernelAttr>()) {
+  if (!D->hasAttr<OpenCLKernelAttr>()
+      && !(LangOpts.HIP && D->getAttr<CUDAGlobalAttr>())) {
     // These attributes cannot be applied to a non-kernel function.
     if (const auto *A = D->getAttr<ReqdWorkGroupSizeAttr>()) {
       // FIXME: This emits a different error message than
       // diag::err_attribute_wrong_decl_type + ExpectedKernelFunction.
       Diag(D->getLocation(), diag::err_opencl_kernel_attr) << A;
       D->setInvalidDecl();
-    } else if (const auto *A = D->getAttr<WorkGroupSizeHintAttr>()) {
+    }
+  }
+  if (!D->hasAttr<OpenCLKernelAttr>()) {
+    // These attributes cannot be applied to a non-kernel function.
+    if (const auto *A = D->getAttr<WorkGroupSizeHintAttr>()) {
       Diag(D->getLocation(), diag::err_opencl_kernel_attr) << A;
       D->setInvalidDecl();
     } else if (const auto *A = D->getAttr<VecTypeHintAttr>()) {
-- 
2.34.1

From b798268365d50239ffc4d158e99816d2a408d11b Mon Sep 17 00:00:00 2001
From: Epliz <<xxx@yyy.com>>
Date: Sun, 12 Feb 2023 17:37:29 +0100
Subject: [PATCH 2/3] Detect when threadIdx.y and threadIdx.z are uniform due
 to a known work group size

---
 llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp    |  2 +-
 llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h      |  4 ++
 .../AMDGPU/AMDGPUTargetTransformInfo.cpp      | 65 ++++++++++++++++++-
 3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
index 7af0a7c9e045..d4ad4540b108 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
@@ -430,7 +430,7 @@ std::pair<unsigned, unsigned> AMDGPUSubtarget::getWavesPerEU(
   return Requested;
 }

-static unsigned getReqdWorkGroupSize(const Function &Kernel, unsigned Dim) {
+unsigned AMDGPUSubtarget::getReqdWorkGroupSize(const Function &Kernel, unsigned Dim) const {
   auto Node = Kernel.getMetadata("reqd_work_group_size");
   if (Node && Node->getNumOperands() == 3)
     return mdconst::extract<ConstantInt>(Node->getOperand(Dim))->getZExtValue();
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
index 7400c81effd0..fad9c14760ef 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
@@ -254,6 +254,10 @@ public:
   /// subtarget without any kind of limitation.
   unsigned getMaxWavesPerEU() const { return MaxWavesPerEU; }

+  ///  \returns the required group size (if known thanks to the
+  /// reqd_work_group_size attribute) for the given (0, 1, 2) dimension.
+  unsigned getReqdWorkGroupSize(const Function &Kernel, unsigned Dim) const;
+
   /// Return the maximum workitem ID value in the function, for the given (0, 1,
   /// 2) dimension.
   unsigned getMaxWorkitemID(const Function &Kernel, unsigned Dimension) const;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 0c3324f84b25..2be5dbd2b0da 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -26,6 +26,7 @@
 #include <optional>

 using namespace llvm;
+using namespace PatternMatch;

 #define DEBUG_TYPE "AMDGPUtti"

@@ -927,11 +928,72 @@ bool GCNTTIImpl::isSourceOfDivergence(const Value *V) const {
   return false;
 }

+static const Instruction* unwrapUniformityNeutralInstructions(const Instruction *Inst) {
+  // some instructions don't change whether the result is uniform
+  // in any circumstances, so unwrap them
+  bool unwrappedSomething;
+  do {
+    unwrappedSomething = false;
+    Instruction *NextInst;
+    if (match(Inst, m_ZExt(m_Instruction(NextInst))) ||
+        match(Inst, m_Trunc(m_Instruction(NextInst)))) {
+      Inst = NextInst;
+      unwrappedSomething = true;
+    }
+  } while(unwrappedSomething);
+
+  return Inst;
+}
+
 bool GCNTTIImpl::isAlwaysUniform(const Value *V) const {
+  bool uniformWorkGroupSize = false;
+
+  unsigned ReqdXSize = std::numeric_limits<unsigned>::max();
+  unsigned ReqdYSize = std::numeric_limits<unsigned>::max();
+  unsigned ReqdZSize = std::numeric_limits<unsigned>::max();
+
+  bool hasReqdXSize = false;
+  bool hasReqdYSize = false;
+
+  const Instruction *Inst = dyn_cast<Instruction>(V);
+
+  if (Inst) {
+    const Function *F = Inst->getFunction();
+
+    uniformWorkGroupSize = F->hasFnAttribute("uniform-work-group-size")
+      && F->getFnAttribute("uniform-work-group-size").getValueAsBool();
+
+    ReqdXSize = ST->getReqdWorkGroupSize(*F, 0);
+    ReqdYSize = ST->getReqdWorkGroupSize(*F, 1);
+    ReqdZSize = ST->getReqdWorkGroupSize(*F, 2);
+
+    hasReqdXSize = ReqdXSize != std::numeric_limits<unsigned>::max();
+    hasReqdYSize = ReqdYSize != std::numeric_limits<unsigned>::max();
+  }
+
+  if (Inst) {
+    // some instructions don't change whether the result is uniform
+    // in any circumstances, so unwrap them
+    Inst = unwrapUniformityNeutralInstructions(Inst);
+
+    // replace V with Inst in the rest of the function
+    V = Inst;
+  }
+
   if (const IntrinsicInst *Intrinsic = dyn_cast<IntrinsicInst>(V)) {
     switch (Intrinsic->getIntrinsicID()) {
     default:
       return false;
+    case Intrinsic::amdgcn_workitem_id_y: {
+      return uniformWorkGroupSize // do we really need the work groups to be of uniform size?
+        // when blockDim.x is a multiple of the wavefront size, threadIdx.y is uniform
+        && (hasReqdXSize && (ReqdXSize % ST->getWavefrontSize()) == 0);
+    }
+    case Intrinsic::amdgcn_workitem_id_z: {
+      return uniformWorkGroupSize // do we really need the work groups to be of uniform size?
+        // when blockDim.x*blockDim.y is a multiple of the wavefront size, threadIdx.z is uniform
+        && (hasReqdXSize && hasReqdYSize && ((ReqdXSize * ReqdYSize) % ST->getWavefrontSize()) == 0);
+    }
     case Intrinsic::amdgcn_readfirstlane:
     case Intrinsic::amdgcn_readlane:
     case Intrinsic::amdgcn_icmp:
@@ -950,7 +1012,7 @@ bool GCNTTIImpl::isAlwaysUniform(const Value *V) const {

   // In most cases TID / wavefrontsize is uniform.
   //
-  // However, if a kernel has uneven dimesions we can have a value of
+  // However, if a kernel has uneven dimensions we can have a value of
   // workitem-id-x divided by the wavefrontsize non-uniform. For example
   // dimensions (65, 2) will have workitems with address (64, 0) and (0, 1)
   // packed into a same wave which gives 1 and 0 after the division by 64
@@ -959,7 +1021,6 @@ bool GCNTTIImpl::isAlwaysUniform(const Value *V) const {
   // FIXME: limit it to 1D kernels only, although that shall be possible
   // to perform this optimization is the size of the X dimension is a power
   // of 2, we just do not currently have infrastructure to query it.
-  using namespace llvm::PatternMatch;
   uint64_t C;
   if (match(V, m_LShr(m_Intrinsic<Intrinsic::amdgcn_workitem_id_x>(),
                       m_ConstantInt(C))) ||
-- 
2.34.1

From ef188b3dad35fee3dfc03c93270c156d5a989190 Mon Sep 17 00:00:00 2001
From: Epliz <<xxx@yyy.com>>
Date: Mon, 6 Feb 2023 18:39:26 +0100
Subject: [PATCH 3/3] Detect warp ID computations from local thread ID
 computations as uniforms

---
 .../Target/AMDGPU/AMDGPUCodeGenPrepare.cpp    |  25 +-
 .../AMDGPU/AMDGPUTargetTransformInfo.cpp      | 279 ++++++++++++++++++
 2 files changed, 299 insertions(+), 5 deletions(-)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
index 08b29641d14a..a4824d83604b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
@@ -1063,17 +1063,32 @@ Value *AMDGPUCodeGenPrepare::expandDivRem32(IRBuilder<> &Builder,
                                             BinaryOperator &I, Value *X,
                                             Value *Y) const {
   Instruction::BinaryOps Opc = I.getOpcode();
-  assert(Opc == Instruction::URem || Opc == Instruction::UDiv ||
-         Opc == Instruction::SRem || Opc == Instruction::SDiv);
+  bool IsDiv = Opc == Instruction::UDiv || Opc == Instruction::SDiv;
+  assert(IsDiv || Opc == Instruction::URem || Opc == Instruction::SRem);

   FastMathFlags FMF;
   FMF.setFast();
   Builder.setFastMathFlags(FMF);

-  if (divHasSpecialOptimization(I, X, Y))
+  if (divHasSpecialOptimization(I, X, Y)) {
+    // Special case for wave ID expression. The divide result is known wave
+    // uniform, but the source numerator is not. Hack in a readfirstlane to
+    // inform codegen of this.
+    //
+    // FIXME: Ideally the codegen divergence analysis would recognize the same
+    // pattern.
+    if (IsDiv && DA->isUniform(&I) && !DA->isUniform(I.getOperand(0))) {
+      Function *Readfirstlane =
+          Intrinsic::getDeclaration(Mod, Intrinsic::amdgcn_readfirstlane);
+
+      // Clone the operation to simplify the value replacement.
+      Value *NewOp = Builder.CreateBinOp(I.getOpcode(), X, Y, I.getName());
+      cast<Instruction>(NewOp)->copyIRFlags(&I);
+
+      return Builder.CreateCall(Readfirstlane, {NewOp});
+    }
     return nullptr;  // Keep it for later optimization.
-
-  bool IsDiv = Opc == Instruction::UDiv || Opc == Instruction::SDiv;
+  }
   bool IsSigned = Opc == Instruction::SRem || Opc == Instruction::SDiv;

   Type *Ty = X->getType();
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 2be5dbd2b0da..42066170d1f2 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -945,6 +945,156 @@ static const Instruction* unwrapUniformityNeutralInstructions(const Instruction
   return Inst;
 }

+// Determine if the constant provided is the required
+// group size
+static bool isGroupSize(uint64_t C, unsigned ReqdSize) {
+  if (ReqdSize == std::numeric_limits<unsigned>::max()) {
+    // the kernel doesn't have the attribute giving us the block size
+    // we can't know if the constant matches the expected value
+    return false;
+  }
+  
+  return (ReqdSize == C);
+}
+
+// Determine if the constant provided is the log2 of the required
+// group size
+static bool isGroupSizeLog2(uint64_t C, unsigned ReqdSize) {
+  if (ReqdSize == std::numeric_limits<unsigned>::max()) {
+    // the kernel doesn't have the attribute giving us the block size
+    // we can't know if the constant matches the expected value
+    return false;
+  }
+  
+  return (ReqdSize == (1ul << C));
+}
+
+/// Recognize a load from the offsets where workgroup sizes are stored in the
+/// dispatch packet.
+static bool isLoadFromGroupSize(const DataLayout &DL, Value *V,
+                                unsigned GroupIdx) {
+  Value *ZextSrc = nullptr;
+  if (!match(V, m_ZExt(m_Value(ZextSrc))) ||
+      !ZextSrc->getType()->isIntegerTy(16))
+    return false;
+
+  LoadInst *LI = dyn_cast<LoadInst>(ZextSrc);
+  if (!LI)
+    return false;
+
+  int64_t Offset = 0;
+  Value *PtrBase =
+      GetPointerBaseWithConstantOffset(LI->getPointerOperand(), Offset, DL);
+  return match(PtrBase, m_Intrinsic<Intrinsic::amdgcn_dispatch_ptr>()) &&
+         Offset == 4 + 2 * GroupIdx;
+}
+
+static bool matchMulGroupSize(
+    const DataLayout &DL,
+    const Instruction *Num,
+    Instruction*& NextNum,
+    unsigned ReqdXSize,
+    unsigned dim
+  ) {
+  Value* MulLHS;
+  uint64_t C;
+  return (match(Num, m_c_Mul(m_Instruction(NextNum), m_ConstantInt(C)))
+        && isGroupSize(C, ReqdXSize))
+        || (match(Num, m_Shl(m_Instruction(NextNum), m_ConstantInt(C)))
+        && isGroupSizeLog2(C, ReqdXSize))
+        // TODO: do we need to try matching both orders explicitly?
+        || (match(Num, m_c_Mul(m_Instruction(NextNum), m_Value(MulLHS)))
+        && isLoadFromGroupSize(DL, MulLHS, dim));
+}
+
+static bool matchLocalThreadIdPattern(
+    unsigned currentDepth,
+    unsigned countMultipliedByDimX,
+    unsigned countMultipliedByDimY,
+    const DataLayout &DL,
+    const Instruction *Num,
+    unsigned ReqdXSize,
+    unsigned ReqdYSize,
+    unsigned ReqdZSize,
+    unsigned& tidxMatchCount,
+    unsigned& tidyMatchCount,
+    unsigned& tidzMatchCount
+) {
+
+  if (currentDepth >= 3) {
+    return false; // expression is too deep, matching failed
+  }
+
+  Num = unwrapUniformityNeutralInstructions(Num);
+
+  auto ItemIDXIntrin = m_Intrinsic<Intrinsic::amdgcn_workitem_id_x>();
+  auto ItemIDYIntrin = m_Intrinsic<Intrinsic::amdgcn_workitem_id_y>();
+  auto ItemIDZIntrin = m_Intrinsic<Intrinsic::amdgcn_workitem_id_z>();
+
+  Instruction *NextNum;
+  if (match(Num, ItemIDXIntrin)) {
+    if (countMultipliedByDimX == 0
+        && countMultipliedByDimY == 0) {
+      tidxMatchCount++;
+      return true; // can't destructure further
+    }
+  } else if (match(Num, ItemIDYIntrin)) {
+    if (countMultipliedByDimX == 1
+        && countMultipliedByDimY == 0) { // TODO or optional multiplier?
+      tidyMatchCount++;
+      return true; // can't destructure further
+    }
+  } else if (match(Num, ItemIDZIntrin)) {
+    if (countMultipliedByDimX == 1
+        && countMultipliedByDimY == 1) {  // TODO or optional multiplier?
+      tidzMatchCount++;
+      return true; // can't destructure further
+    }
+  } else if (match(Num, m_c_Add(m_Instruction(NextNum), ItemIDXIntrin))) {
+    if (countMultipliedByDimX == 0
+        && countMultipliedByDimY == 0) {
+      tidxMatchCount++;
+    }
+    //
+    Num = NextNum;
+    // TODO support add when the expression is not factorized
+    if (matchMulGroupSize(DL, Num, NextNum, ReqdXSize, 0)) {
+        // I Was here
+      // TODO: maybe we should not match the multiplicatiom commutatively?
+      return matchLocalThreadIdPattern(
+          currentDepth + 1,
+          countMultipliedByDimX + 1,
+          countMultipliedByDimY,
+          DL,
+          NextNum,
+          ReqdXSize, ReqdYSize, ReqdZSize,
+          tidxMatchCount, tidyMatchCount, tidzMatchCount
+        );
+    }
+
+  } else if (match(Num, m_c_Add(m_Instruction(NextNum), ItemIDYIntrin))) {
+    if (countMultipliedByDimX == 1
+        && countMultipliedByDimY == 0) {
+      tidyMatchCount++;
+    }
+    //
+    Num = NextNum;
+    // TODO support add when the expression is not factorized
+    if (matchMulGroupSize(DL, Num, NextNum, ReqdYSize, 1)) {
+      return matchLocalThreadIdPattern(
+          currentDepth + 1,
+          countMultipliedByDimX,
+          countMultipliedByDimY + 1,
+          DL,
+          NextNum,
+          ReqdXSize, ReqdYSize, ReqdZSize,
+          tidxMatchCount, tidyMatchCount, tidzMatchCount
+        );
+    }
+  }
+  return false; // matching failed
+}
+
 bool GCNTTIImpl::isAlwaysUniform(const Value *V) const {
   bool uniformWorkGroupSize = false;

@@ -1041,6 +1191,135 @@ bool GCNTTIImpl::isAlwaysUniform(const Value *V) const {
            ST->getMaxWorkitemID(*F, 1) == 0 && ST->getMaxWorkitemID(*F, 2) == 0;
   }

+  if (uniformWorkGroupSize) {
+    if (Inst) {
+      switch (Inst->getOpcode()) {
+      case Instruction::AShr:
+      case Instruction::LShr:
+      case Instruction::SDiv:
+      case Instruction::UDiv:
+      case Instruction::And: {
+
+        // Special case for pattern used for wave indexing calculations such as:
+        // 
+        // Local thread ID patterns:
+        // 3D local thread ID:
+        // ((threadIdx.z * blockDim.y + threadIdx.y) * blockDim.x + threadId.x) / (K * wavesize)
+        // 2D local thread ID:
+        // (threadIdx.y * blockDim.x + threadId.x) / (K * wavesize)
+        // 1D local thread ID:
+        // threadId.x / (K * wavesize)
+        //
+        //
+        // In those cases, the division result is uniform if K>=1 .
+        // In the case where K is a power of two, the division corresponds to some right shift
+        // so we need to pay attention to that.
+
+        bool divisionByWavesizeMultiple = false;
+
+        Instruction *Num;
+ 
+        if (match(Inst, m_LShr(m_Instruction(Num), m_ConstantInt(C))) ||
+            match(Inst, m_AShr(m_Instruction(Num), m_ConstantInt(C)))) {
+          // when the division is by a power-of-two multiple of the wavefront size,
+          // we should get shifts
+          bool shiftBiggerThanWavesizeLog2 = (C >= ST->getWavefrontSizeLog2());
+          if (shiftBiggerThanWavesizeLog2) {
+            divisionByWavesizeMultiple = true;
+          }
+        } else if (match(Inst, m_SDiv(m_Instruction(Num), m_ConstantInt(C))) ||
+            match(Inst, m_UDiv(m_Instruction(Num), m_ConstantInt(C)))) {
+          // we can get a real division potentially if the divisor is non-power-of-two multiple of
+          // the wavefront size
+          // TODO should maybe check C > 0?
+          bool waveSizeMultiple = (C % ST->getWavefrontSize()) == 0;
+          if (waveSizeMultiple) {
+            divisionByWavesizeMultiple = true;
+          }
+        } else if (match(Inst, m_c_And(m_LShr(m_Instruction(Num), m_ConstantInt(C)), m_Value(Mask))) ||
+                   match(Inst, m_c_And(m_AShr(m_Instruction(Num), m_ConstantInt(C)), m_Value(Mask)))) {
+          // AND(SHR(x, C), m) <=> SHL(SHR(x, C + trz(m)), trz(m))
+          // TODO make sure it is correct
+          unsigned numTrailingZeroes = computeKnownBits(Mask, DL).countMinTrailingZeros();
+          bool waveSizeMultiple = (C + numTrailingZeroes) >= ST->getWavefrontSizeLog2();
+          if (waveSizeMultiple) {
+            divisionByWavesizeMultiple = true;
+          }
+        } else if (match(Inst, m_c_And(m_Shl(m_Instruction(Num), m_ConstantInt(C)), m_Value(Mask)))) {
+          // AND(SHL(x, C), m) <=> SHL(SHR(x, trz(m) - C), trz(m)), C <= trz(m)
+          // TODO make sure it is correct
+          unsigned numTrailingZeroes = computeKnownBits(Mask, DL).countMinTrailingZeros();
+          bool waveSizeMultiple = (numTrailingZeroes >= C) && ((numTrailingZeroes - C) >= ST->getWavefrontSizeLog2());
+          if (waveSizeMultiple) {
+            divisionByWavesizeMultiple = true;
+          }
+        } else if (match(Inst, m_c_And(m_Instruction(Num), m_Value(Mask)))) {
+          // AND(x, m) <=> SHL(SHR(x, trz(m)), trz(m))
+          // TODO make sure it is correct
+          // if there are enough trailing zeroes, it means the SHR was big enough to potentially make the
+          // local thread ID expression uniform
+          bool waveSizeMultiple = computeKnownBits(Mask, DL).countMinTrailingZeros() >=
+               ST->getWavefrontSizeLog2();
+          if (waveSizeMultiple) {
+            divisionByWavesizeMultiple = true;
+          }
+        }
+
+        if (divisionByWavesizeMultiple) {
+          const Function *F = Inst->getFunction();
+          const DataLayout &DL = Inst->getModule()->getDataLayout();
+
+          // Two cases here:
+          // 1) we are lucky and know the block size because we got
+          // reqd_workrgoup_size specified.
+          // In that case we can be super precise and use the info to match the
+          // actually required parts (if a block dimension is one, the associated thread index
+          // is always 0, so it doesn't need to be part of the computation)
+          //
+          //
+          // 2) we don't have it
+          // then we can only know the result is uniform if we get a 3D local ID
+          // the code below treats both cases at once
+          bool tidxIsOptional = (ST->getMaxWorkitemID(*F, 0) == 0);
+          bool tidyIsOptional = (ST->getMaxWorkitemID(*F, 1) == 0);
+          bool tidzIsOptional = (ST->getMaxWorkitemID(*F, 2) == 0);
+
+          unsigned tidxMatchCount = 0;
+          unsigned tidyMatchCount = 0;
+          unsigned tidzMatchCount = 0;
+          bool matchingWorked = false;
+          // TODO case where we match threadIdx.y + blockDim.y * threadIdx.z only
+          // (can happen when it is possible to remove threadIdx.x)
+
+          // try to match the pattern recursively
+          matchingWorked = matchLocalThreadIdPattern(
+            0, // current depth
+            0, // mult x
+            0, // mult y
+            DL,
+            Num,
+            ReqdXSize, ReqdYSize, ReqdZSize,
+            tidxMatchCount, tidyMatchCount, tidzMatchCount
+          );
+
+          // we can say it is going to be uniform only if we matched exactly once
+          // each thread index component, or it was optional
+          bool matchedTidxOnceOrOptional = (tidxMatchCount == 1) || (tidxIsOptional && tidxMatchCount == 0);
+          bool matchedTidyOnceOrOptional = (tidyMatchCount == 1) || (tidyIsOptional && tidyMatchCount == 0);
+          bool matchedTidzOnceOrOptional = (tidzMatchCount == 1) || (tidzIsOptional && tidzMatchCount == 0);
+          if (matchingWorked && matchedTidxOnceOrOptional && matchedTidyOnceOrOptional && matchedTidzOnceOrOptional) {
+            return true;
+          }
+        }
+
+        break;
+      }
+      default:
+        break;
+      }
+    }
+  }
+
   const ExtractValueInst *ExtValue = dyn_cast<ExtractValueInst>(V);
   if (!ExtValue)
     return false;
-- 
2.34.1

Test cases I am using (needs more):

#if !defined(HIP_RTC_COMPILATION)
// needed for standalone compilation only
#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>
#endif

#define KERNEL_FUNC extern "C" __global__

// Test cases for when threadIdx.x, threadIdx.y, threadIdx.z are uniform because of the known block size

// TODO:
//  - cases where it needs some small divisions to make threadIdx.y/C and threadIdx.z/C to be uniform
__attribute__((reqd_work_group_size(1, 4, 1)))
KERNEL_FUNC void kernel_ltid_x_uniform(const float* a, float* b) {
  // uniform because blockDim.x has a size multiple of wavefront
  const uint ltid = threadIdx.x;
  const uint wid = ltid;
  b[ltid] = a[wid];
}

__attribute__((reqd_work_group_size(128, 4, 1)))
KERNEL_FUNC void kernel_ltid_y_uniform(const float* a, float* b) {
  // uniform because blockDim.x has a size multiple of wavefront
  const uint ltid = threadIdx.y;
  const uint wid = ltid;
  b[ltid] = a[wid];
}

__attribute__((reqd_work_group_size(64, 1, 3)))
KERNEL_FUNC void kernel_ltid_z_uniform(const float* a, float* b) {
  // uniform because blockDim.x * blockDim.y has a size multiple of wavefront
  const uint ltid = threadIdx.z;
  const uint wid = ltid;
  b[ltid] = a[wid];
}

// Tests for warp ID computation patterns

// TODO:
//  - cases where we can omit threadIdx.x
//  - non power of two block sizes (to not use shifts)
//  - with optional parameters

// With reqd_work_group_size:
//
__attribute__((reqd_work_group_size(128, 1, 1)))
KERNEL_FUNC void kernel_ltid_x(const float* a, float* b) {
  const uint ltid = threadIdx.x;
  const uint wid = ltid / warpSize;
  b[ltid] = a[wid];
}

__attribute__((reqd_work_group_size(256, 1, 1)))
KERNEL_FUNC void kernel_ltid_x_npot_div(const float* a, float* b) {
  const uint ltid = threadIdx.x;
  // somehow, there seems to be a "trunc" operation
  const uint wid = ltid / (3 * warpSize);
  b[ltid] = a[wid];
}

__attribute__((reqd_work_group_size(64, 2, 1)))
KERNEL_FUNC void kernel_ltid_xy(const float* a, float* b) {
  const uint ltid = threadIdx.y * blockDim.x + threadIdx.x;
  const uint wid = ltid / warpSize; // = threadIdx.y when blockDim.x == 64 && warpSize == 64
  b[ltid] = a[wid];
}

__attribute__((reqd_work_group_size(64, 5, 1)))
KERNEL_FUNC void kernel_ltid_xy_npot_div(const float* a, float* b) {
  const uint ltid = threadIdx.y * blockDim.x + threadIdx.x;
  const uint wid = ltid / (3 * warpSize); // = threadIdx.y when blockDim.x == 64 && warpSize == 64
  b[ltid] = a[wid];
}

__attribute__((reqd_work_group_size(32, 32, 2)))
KERNEL_FUNC void kernel_ltid_xyz(const float* a, float* b) {
  const uint ltid = (threadIdx.z * blockDim.y + threadIdx.y) * blockDim.x + threadIdx.x;
  const uint wid = ltid / warpSize;
  b[ltid] = a[wid];
}

__attribute__((reqd_work_group_size(32, 32, 2)))
KERNEL_FUNC void kernel_ltid_xyz_expanded(const float* a, float* b) {
  // NOT DETECTED AS UNIFORM
  const uint ltid = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;
  const uint wid = ltid / warpSize; // == (threadIdx.z * blockDim.y + threadIdx.y)/2  when warpSize == 64
  b[ltid] = a[wid];
}

__attribute__((reqd_work_group_size(32, 32, 2)))
KERNEL_FUNC void kernel_ltid_xyz_npot_div(const float* a, float* b) {
  const uint ltid = (threadIdx.z * blockDim.y + threadIdx.y) * blockDim.x + threadIdx.x;
  const uint wid = ltid / (7 * warpSize);
  b[ltid] = a[wid];
}

__attribute__((reqd_work_group_size(32, 32, 2)))
KERNEL_FUNC void kernel_ltid_xyz_npot_div_expanded(const float* a, float* b) {
  // NOT DETECTED AS UNIFORM
  const uint ltid = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;
  const uint wid = ltid / (7 * warpSize);
  b[ltid] = a[wid];
}

//
// Without reqd_work_group_size:
// (can only optimize for 3D IDs)
//
KERNEL_FUNC void kernel_ltid_xyz_no_reqd_block_size(const float* a, float* b) {
  // NOT DETECTED AS UNIFORM
  const uint ltid = (threadIdx.z * blockDim.y + threadIdx.y) * blockDim.x + threadIdx.x;
  const uint wid = ltid / warpSize;
  b[ltid] = a[wid];
}

KERNEL_FUNC void kernel_ltid_xyz_no_reqd_block_size_expanded(const float* a, float* b) {
  // NOT DETECTED AS UNIFORM
  const uint ltid = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;
  const uint wid = ltid / warpSize;
  b[ltid] = a[wid];
}

KERNEL_FUNC void kernel_ltid_xyz_npot_div_no_reqd_block_size_expanded(const float* a, float* b) {
  const uint ltid = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;
  const uint wid = ltid / (7 * warpSize);
  b[ltid] = a[wid];
}

KERNEL_FUNC void kernel_ltid_xyz_npot_div_no_reqd_block_size(const float* a, float* b) {
  const uint ltid = (threadIdx.z * blockDim.y + threadIdx.y) * blockDim.x + threadIdx.x;
  const uint wid = ltid / (7 * warpSize);
  b[ltid] = a[wid];
}

Let me know what you think, Best regards, Epliz

From trying to minimize my changes, I actually saw that the propagation of uniformity seems to work, the only thing that doesn't seem to work is pushing uniformity backward, but I guess it is not implemented. On the example

    v_lshlrev_b32_e32 v0, 2, v1
    v_readfirstlane_b32 s4, v0

The v_readfirstlane_b32 was inserted because the expression is detected as uniform (threadIdx.y is uniform as block size is a multiple of the wavefront size). But we could push it before the shift as the shift is with a constant shift amount, so if the result is uniform, it is because the input is uniform. In general, I guess the rule is that we can make v_readfirstlane to be pushed earlier any operation that is not a cross-lane operation or modifying execution flags.

This isn't really the place to do a large amount of code review. However, couple of notes:

I disagree with bloating up TTI::isAlwaysUniform. As you've noticed, to actually get better code generation a readfirstlane needs to be introduced. It would be better to put the matching into a pass which then also introduces the readfirstlane. I don't feel as strongly about where exactly this should be, I suppose CodeGenPrepare makes sense.
Yes, you do need to check for uniform work group size. Non-uniform means that a grid is dispatched which is not a multiple of the work group size, and in that case the work groups along the boundary receive an irregular assignment of local thread idx to lane&wave.
The isGroupSize is better handled differently by inlining in the caller.

Hi @nhaehnle , @arsenm , Do you think it would be possible for you to merge the patches 1 and 2? I understand that patch 3 might not be in a great state. I feel like it would be much faster if you take them over than me having to go through all the unfamiliar processes to get there, and having to do so on my sadly limited free time.

Please post patches for review to Phabricator. I also have quite limited time and this is not the appropriate forum for code reviews

llvm / llvm-project

warp id computation patterns not leveraged to use SGPRs afterwards in AMDGPU backend #54010