llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.18k stars 11.13k forks source link

[LSV] Load Store Vectorizer failed to combine two related loads. #97715

Open cdevadas opened 3 weeks ago

cdevadas commented 3 weeks ago

The LSV pass doesn't combine the loads even though the base address remain the same for them. There are many instances found in the AMDGPU codegen lit test folder. For example, llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll (in function local_atomic_fadd_v2bf16_noret) should have the two loads combined earlier. But they are merged using the target specific Load Store Optimizer pass (si-load-store-opt) after ISel.

jayfoad commented 3 weeks ago

llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll (in function local_atomic_fadd_v2bf16_noret)

I don't see that function in that file. Can you provide a link?

cdevadas commented 3 weeks ago

Looks like I have a slightly older version of upstream compiler. This test has been changed. However the same test compiled for SelectionDAG still exists and it reproduces the problem. https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/fp-atomics-gfx940.ll#L364

llvmbot commented 3 weeks ago

@llvm/issue-subscribers-backend-amdgpu

Author: Christudasan Devadasan (cdevadas)

The LSV pass doesn't combine the loads even though the base address remain the same for them. There are many instances found in the AMDGPU codegen lit test folder. For example, llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll (in function local_atomic_fadd_v2bf16_noret) should have the two loads combined earlier. But they are merged using the target specific Load Store Optimizer pass (si-load-store-opt) after ISel.
arsenm commented 3 days ago

I think this is just from unhandled insertion of bitcasts to get the types to match for vectorization:

; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -print-after=load-store-vectorizer < %s

define amdgpu_kernel void @no_vectorize_0(ptr addrspace(3) %ptr, <2 x half> %data) {
  %i1 = atomicrmw fadd ptr addrspace(3) %ptr, <2 x half> %data syncscope("agent") seq_cst, align 4
  ret void
}

define amdgpu_kernel void @no_vectorize_1(i32 %ptr.as.int, <2 x half> %data) {
  %ptr = inttoptr i32 %ptr.as.int to ptr addrspace(3)
  %i1 = atomicrmw fadd ptr addrspace(3) %ptr, <2 x half> %data syncscope("agent") seq_cst, align 4
  ret void
}

define amdgpu_kernel void @does_vectorize(i32 %ptr.as.int, i32 %data.as.int) {
  %ptr = inttoptr i32 %ptr.as.int to ptr addrspace(3)
  %data = bitcast i32 %data.as.int to <2 x half>
  %i1 = atomicrmw fadd ptr addrspace(3) %ptr, <2 x half> %data syncscope("agent") seq_cst, align 4
  ret void
}