[NVPTX] st.global.u32 become st.u32 after update from llvm10 to llvm15

python3kgae commented 2 years ago

For llvm ir like this

target triple = "nvptx64-nvidia-cuda"

; Function Attrs: nounwind
define void @foo(i64* nocapture readonly byval(i64) %0, i64* nocapture readonly byval(i64) %itop) local_unnamed_addr #0 {
entry:
  %1 = bitcast i64* %0 to float**
  %2 = load float*, float** %1, align 8
  store float 1.000000e+00, float* %2, align 4
  %3 = load i64, i64* %itop, align 8
  %4 = inttoptr i64 %3 to float*
  store float 2.000000e+00, float* %4, align 4  
  ret void
}

attributes #0 = { nounwind }

!llvm.linker.options = !{!0, !1, !2, !3, !4}
!llvm.module.flags = !{!5, !6}
!llvm.ident = !{!7}
!nvvmir.version = !{!8}
!nvvm.annotations = !{!9, !10, !9, !11, !11, !11, !11, !12, !12, !11, !13, !14, !15}

!0 = !{!"/FAILIFMISMATCH:\22_MSC_VER=1900\22"}
!1 = !{!"/FAILIFMISMATCH:\22_ITERATOR_DEBUG_LEVEL=0\22"}
!2 = !{!"/FAILIFMISMATCH:\22RuntimeLibrary=MT_StaticRelease\22"}
!3 = !{!"/DEFAULTLIB:libcpmt.lib"}
!4 = !{!"/FAILIFMISMATCH:\22_CRT_STDIO_ISO_WIDE_SPECIFIERS=0\22"}
!5 = !{i32 1, !"wchar_size", i32 2}
!6 = !{i32 7, !"PIC Level", i32 2}
!7 = !{!"clang version 10.0.0 "}
!8 = !{i32 1, i32 4}
!9 = !{null, !"align", i32 8}
!10 = !{null, !"align", i32 8, !"align", i32 65544, !"align", i32 131080}
!11 = !{null, !"align", i32 16}
!12 = !{null, !"align", i32 16, !"align", i32 65552, !"align", i32 131088}
!13 = !{void (i64*, i64*)* @foo, !"kernel", i32 1}
!14 = !{void (i64*, i64*)* @foo, !"maxntidx", i32 1}
!15 = !{void (i64*, i64*)* @foo, !"minctasm", i32 2}

Instruction combine in llvm10 will transform

  %3 = load i64, i64* %itop, align 8
  %4 = inttoptr i64 %3 to float*
  store float 2.000000e+00, float* %4, align 4

into

  %3 = bitcast i64* %itop to float**
  %4 = load float*, float** %3, align 8
  store float 2.000000e+00, float* %4, align 4

And the final output ptx will use st.global.u32 for the store.

After update to llvm15, instruction combine will not do the transform anymore. As a result, final output ptx will use st.u32 for the store.

Is this change expected?

Thanks Xiang

Artem-B commented 2 years ago

AFAICT LLVM 10.x and @HEAD produces the same code for your example: https://godbolt.org/z/resb7zbnf Neither seems to do the inttoptr -> bitcast conversion you're suggesting were done previously. Perhaps there are some differences in your compilation pipeline that may have affected which passes were run?

If you can reproduce the issue on compiler explorer that would help a lot.

python3kgae commented 2 years ago

Forget to mention -O3 :( Here's the compiler explorer link https://godbolt.org/z/jKxafYr5E You can see opt get a different result. And this is the link for the opt result to LLC https://godbolt.org/z/PhTGWMf1s.

Thanks Xiang

Artem-B commented 2 years ago

Thank you. Whatever affects this example has started with LLVM-12. The IR transform does not appear to be target-specific (e.g. opt -mtriple=aarch64--) is affected the same way.

I believe that it's LLVM-10 that was wrong and your example does not exactly do apples-to-apples comparison.

define void @foo(i64 nocapture readonly byval(i64) %0, i64 nocapture readonly byval(i64) %itop) local_unnamed_addr #0 { entry: %1 = bitcast i64* %0 to float**

Here we do know that we're storing to a poionter that was given to the kernel as a parameter. By convention all kernel input pointers are assumed to be global, so LLVM infers the correct AS and lowers the store to st.global.

%2 = load float*, float* %1, align 8 store float 1.000000e+00, float %2, align 4

%3 = load i64, i64* %itop, align 8

Here you're passing a pointer to an integer and, once that integer is loaded that's what probably breaks the address space inference.

%4 = inttoptr i64 %3 to float store float 2.000000e+00, float %4, align 4
ret void }

I do not know what exactly prevents transforming load i64, inttoptr -> load i64*, bitcast i64*, float*, but it does the right thing here, IMO.

While in this case it would happen to be OK, I do not think that would be correct assumption in general.

If you do want compiler to generate AS-specific loads/stores, you do need to give it correct information to make it happen. You can do it explicitly via addrspacecast or implicitly by passing it as a pointer to a kernel.

In short, I think LLVM is working as intended now.

python3kgae commented 2 years ago

I see. Thanks to make it clear.

llvm / llvm-project

[NVPTX] st.global.u32 become st.u32 after update from llvm10 to llvm15 #57188