llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.9k stars 11.51k forks source link

NVPTX: Cannot select: = NVPTXISD::LoadParam<(load (s24), align 4)> #55764

Closed kjetilkjeka closed 2 years ago

kjetilkjeka commented 2 years ago

I'm currently working on improving nvptx support in rust. The rust compiler produces llvm-ir with struct return values bitcast into integer values. This is fine for most targets, but for the triple nvptx64-nvidia-cuda llc cannot select for these types. A description going deeper into the rust side of things can be found here I will try to keep the rest of this issue focused only on the llvm side.

The problem from the LLVM side of things is that well formed llvm-ir cannot be compiled into ptx. I'm not sure if it is coincidental that clang always uses the actual struct as the return type, or if it is because it is better to work around this bug than to fix it. I do think that it will be a ugly hack in rustc to special case this for the nvptx target, and is thus willing to put a little amount of work into helping out fixing this bug if it is a realistic option.

I do realize this is only tested on a rather old version of llvm. I looked at your issue tracker and do not believe it is fixed. I apologize if this is a duplicate or have already been fixed.

Simplified example code

I believe this is a LLVM backend bug, and therefore the most interesting source is perhaps a simplified .ll example. The following is generated with bugpoint:

; ModuleID = 'bugpoint-reduced-simplified.bc'
source_filename = "main_rs.99b55f54-cgu.0"
target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

%Foo.0.1.2.3.4.5 = type { i8, i8, i8 }

declare dso_local i24 @_ZN7main_rs6device17h13d51938dd622c57E(i8) unnamed_addr #0

define void @kernel(i8* %input, %Foo.0.1.2.3.4.5* %output) unnamed_addr #0 {
start:
  %0 = call i24 @_ZN7main_rs6device17h13d51938dd622c57E(i8 undef) #2
  unreachable
}

; Function Attrs: argmemonly nofree nounwind willreturn
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #1

attributes #0 = { "target-cpu"="sm_30" }
attributes #1 = { argmemonly nofree nounwind willreturn }
attributes #2 = { nounwind }

!llvm.module.flags = !{!0}

!0 = !{i32 7, !"PIC Level", i32 2}

Backtrace

When attempting to compile this with llc-14 from the Ubuntu 22.04 apt repo, the following is returned:

LLVM ERROR: Cannot select: 0xbebd58: i32,ch,glue = NVPTXISD::LoadParam<(load (s24), align 4)> 0xbebcf0, Constant:i32<1>, Constant:i32<0>, 0xbebcf0:1
  0xbeba80: i32 = Constant<1>
  0xbeb8e0: i32 = Constant<0>
  0xbebcf0: ch,glue = NVPTXISD::CallArgEnd 0xbebc88, Constant:i32<1>, 0xbebc88:1
    0xbeba80: i32 = Constant<1>
    0xbebc88: ch,glue = NVPTXISD::LastCallArg 0xbebc20, Constant:i32<1>, Constant:i32<0>, 0xbebc20:1
      0xbeba80: i32 = Constant<1>
      0xbeb8e0: i32 = Constant<0>
      0xbebc20: ch,glue = NVPTXISD::CallArgBegin 0xbebbb8, 0xbebbb8:1
        0xbebbb8: ch,glue = NVPTXISD::CallVoid 0xbebb50, 0xbeb740, 0xbebb50:1
          0xbeb740: i64 = NVPTXISD::Wrapper TargetGlobalAddress:i64<i24 (i8)* @_ZN7main_rs6device17h13d51938dd622c57E> 0
            0xbeb670: i64 = TargetGlobalAddress<i24 (i8)* @_ZN7main_rs6device17h13d51938dd622c57E> 0
          0xbebb50: ch,glue = NVPTXISD::PrintCallUni 0xbebae8, Constant:i32<1>, 0xbebae8:1
            0xbeba80: i32 = Constant<1>
            0xbebae8: ch,glue = NVPTXISD::DeclareRet 0xbeba18, Constant:i32<1>, Constant:i32<32>, Constant:i32<0>, 0xbeba18:1
              0xbeba80: i32 = Constant<1>
              0xbeb948: i32 = Constant<32>
              0xbeb8e0: i32 = Constant<0>
              0xbeba18: ch,glue = NVPTXISD::StoreParam<(store (s32))> 0xbeb9b0, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, 0xbeb9b0:1
                0xbeb8e0: i32 = Constant<0>
                0xbeb8e0: i32 = Constant<0>
                0xbeb8e0: i32 = Constant<0>
                0xbeb9b0: ch,glue = NVPTXISD::DeclareScalarParam 0xbeb878, Constant:i32<0>, Constant:i32<32>, Constant:i32<0>, 0xbeb878:1
                  0xbeb8e0: i32 = Constant<0>
                  0xbeb948: i32 = Constant<32>
                  0xbeb8e0: i32 = Constant<0>
                  0xbeb878: ch,glue = callseq_start 0xbb7bb8, TargetConstant:i64<0>, TargetConstant:i64<0>

In function: kernel
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: llc-14 test.ll
1.      Running pass 'Function Pass Manager' on module 'test.ll'.
2.      Running pass 'NVPTX DAG->DAG Pattern Instruction Selection' on function '@kernel'
 #0 0x00007fce2350ed01 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0xe3fd01)
 #1 0x00007fce2350ca3e llvm::sys::RunSignalHandlers() (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0xe3da3e)
 #2 0x00007fce2350f236 (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0xe40236)
 #3 0x00007fce221b6520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #4 0x00007fce2220aa7c __pthread_kill_implementation ./nptl/./nptl/pthread_kill.c:44:76
 #5 0x00007fce2220aa7c __pthread_kill_internal ./nptl/./nptl/pthread_kill.c:78:10
 #6 0x00007fce2220aa7c pthread_kill ./nptl/./nptl/pthread_kill.c:89:10
 #7 0x00007fce221b6476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #8 0x00007fce2219c7f3 abort ./stdlib/./stdlib/abort.c:81:7
 #9 0x00007fce23449723 (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0xd7a723)
#10 0x00007fce23d2fe61 (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0x1660e61)
#11 0x00007fce23d2f308 llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0x1660308)
#12 0x00007fce23d27eff llvm::SelectionDAGISel::DoInstructionSelection() (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0x1658eff)
#13 0x00007fce23d275c4 llvm::SelectionDAGISel::CodeGenAndEmitDAG() (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0x16585c4)
#14 0x00007fce23d269f8 llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0x16579f8)
#15 0x00007fce23d24587 llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0x1655587)
#16 0x00007fce238952be llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0x11c62be)
#17 0x00007fce236493c0 llvm::FPPassManager::runOnFunction(llvm::Function&) (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0xf7a3c0)
#18 0x00007fce236509b3 llvm::FPPassManager::runOnModule(llvm::Module&) (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0xf819b3)
#19 0x00007fce23649f66 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/lib/x86_64-linux-gnu/libLLVM-14.so.1+0xf7af66)
#20 0x0000000000410223 main (/usr/lib/llvm-14/bin/llc+0x410223)
#21 0x00007fce2219dd90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#22 0x00007fce2219de40 call_init ./csu/../csu/libc-start.c:128:20
#23 0x00007fce2219de40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#24 0x000000000040b065 _start (/usr/lib/llvm-14/bin/llc+0x40b065)
Aborted (core dumped)

Meta

My exact version of llc-14 is:

$ llc-14 --version
Ubuntu LLVM version 14.0.0
Artem-B commented 2 years ago

I guess we should've promoted it up to the next larger power-of-2 sized integer.

kjetilkjeka commented 2 years ago

This seems reasonable, it should work fine for the rustc issue! Let me know if there's anything I can do to help out.

Edit: Removed a concern that was based on a misunderstanding

Artem-B commented 2 years ago

Where did s24 come from? Is that coalesced {i8, i8, i8} ? If you were to pass the aggregate directly, it would work fine, though the copy would probably be done field-by-field.

kjetilkjeka commented 2 years ago

Rustc turns aggregates smaller than the register size into an integer that can be placed in a single register when passing into or returning from a function. This optimization is only made when using the native rust abi, not the c or ptx-kernel abi. I discovered the bug when working on an image of Rgb<u8> pixels with the following generic type

struct Rgb<T> {
    r: T,
    g: T,
    b: T,
}

This bitcast from {i8, i8, i8} into i24 before returning the value is done as an optimization. A comment in the rustc src states that "using a llvm aggregate type leads to bad optimization, so we pick an appropriately sized integer type instead.". But I'm not aware of any details other than that.

A temporary workaround is to use the C (extern "C" fn) function abi. This can not unfortunately not be done in all cases and may lead to worse performance in some cases where it is possible to use.

kjetilkjeka commented 2 years ago

After looking a bit more into this today I have a couple of questions regarding the following (redacted) Rust code that is generated into the following (redacted) llvm-ir.

Rust

#[repr(C)]
#[derive(Clone, Copy)]
pub struct Foo {
    a: u8,
    b: u8,
    c: u8,
}

#[inline(never)]
fn device(v: Foo) -> Foo {
    v
}

LLVM-IR

%Foo = type { i8, i8, i8 }

; main_rs::device
; Function Attrs: noinline nounwind
define internal i24 @_ZN7main_rs6device17h534b9a1ae91055fbE(i24 %0) unnamed_addr #1 {
start:
  %1 = alloca %Foo, align 1
  %2 = alloca i24, align 4
  %v = alloca %Foo, align 1
  store i24 %0, i24* %2, align 4
  %3 = bitcast %Foo* %v to i8*
  %4 = bitcast i24* %2 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 %3, i8* align 4 %4, i64 3, i1 false)
  %5 = bitcast %Foo* %1 to i8*
  %6 = bitcast %Foo* %v to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 %5, i8* align 1 %6, i64 3, i1 false)
  %7 = bitcast %Foo* %1 to i24*
  %8 = load i24, i24* %7, align 1
  ret i24 %8
}

Copying between Foo and i24/i32

Rustc generates llvm-ir which uses the type Foo as the llvm-ir type %Foo = type { i8, i8, i8 } and as the llvm-ir type i24. Based on my observations, it seems like it is only bitcast into i24 when returning from or passing into functions.

If I understand correctly the reason this works is that bitcast is only being done on the pointers and not the aggregate type itself. The Foo and i24 is being converted between by doing a memcpy instead. I see that the pointer bitcast would still be compatible if we promoted i24 to i32.

My questions is: Are there any problems associated with only copying 3 bytes into the i32 that will be returned and leaving 8bits as undefined?

What (exactly) will need to be done.

I had a look in llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp but I couldn't figure out exactly how i24 is being lowered to ptx in general. My impression, mostly from looking at the methods and reading some code, is that it happens for all instructions individually. If this is correct then I assume some code needs to be appended to all the relevant methods (like LowerLOAD, LowerSTORE, LowerReturn, etc) and promote types to next level of power-of-two.

Is this understanding correct? Are you (@Artem-B) actively working on it or would it be beneficial if I took a stab at it?

Artem-B commented 2 years ago

Are you (@Artem-B) actively working on it or would it be beneficial if I took a stab at it?

It's not at the top of my todo list and I don't know when I'll get to it, so feel free to poke at it.

how i24 is being lowered to ptx in general.

It's not, which is the reason for the crash you've reported. Considering that PTX does not have 24-bit load/store variant, if you insist on operating on i24 types, it would have to be split into smaller loads/stores. E.g. a triple of i8 ops, or a pair of i8/i16 or i16/i8, depending on alignment and some glue code to split/merge those bits to/from i24. That would be relatively slow.

Performance-wise it would probably be better not to use i24 and, if you do need to pack into integers, pack into power-of-2 sized ones that can be lowered into efficient code.

I'm not quite sure what drives this hand-packing of aggregates on IR level. Is that part of the Rust ABI? Normally it's target-dependent. E.g. on x64 small aggregates are passed in registers, power-of-2 sized, at that. Something like that may make sense for the rust-generated code for NVPTX.

kjetilkjeka commented 2 years ago

I had the chance to look a bit more into this today and due to being unfamiliar with the LLVM structure I hope to get my perceived understanding confirmed and maybe also some pointers in the right direction.

It seems to me like the Legalizer exists partially for turning non-supported llvm types into supported ones already on the LLVM-IR level (before lowering). It also seems like there are no implementation of class NVPTXLegalizerInfo : public LegalizerInfo which is the reason the Cannot select: occurs for nvptx64-nvidia-cuda, where other architectures with a Legalizer (like x86_64) will have the i24 turned into something else. Is this understanding correct?

Is the solution to the problem to add the Legalizer for NVPTX target? Are there any problems by doing this? A chance of breaking existing ptx code or similar?

The implementation for Legalizer seems quite different for different targets. X86 seems like something I expected, AMDGPU seems surprisingly long and detail oriented, while RISC-V gets away with a single call getLegacyLegalizerInfo().computeTables();. At the same time it is mentioned in the header that these files should be generated by TableGen and the word Legacy occurs a lot when reading the code of existing Legalizer. What is the best approach to finding out exactly what the NVPTXLegalizerInfo should look like?

Artem-B commented 2 years ago

Legalizer does sound like the right place to handle this.

A lot of things in LLVM grow 'organically' so the new features/mechanisms are not always automatically propagated to all back-ends. In general, small reasonable changes (as in both understandable and sensible for the issue at hand) are the way to go. As long as they move LLVM in the right direction (incremental steps are fine), have tests, and don't break anyone upstreaming them should be pretty straightforward.

kjetilkjeka commented 2 years ago

I'm struggling a bit with making progress. It seems like in the NVPTXTargetLowering constructor it is possible to register combination of operations and registers as Legal, Custom, Promote, etc to get the desired behavior using targetLoweringBase::setOperationAction. The problem is that the function only accepts MVT while, if I understand correctly, types like i24 is an EVT.

Do you have a pointer in the right direction of where the corresponding EVT behavior can be found?

Artem-B commented 2 years ago

I don't really know. If the standard lowering machinery can't be used with odd-sized integers, I'd try checking what happens if the same code gets compiled by a back-end that can deal with it. Track what they do differently and that may provide enough of a hint on how to do it in NVPTX.

It may be easier to consider changing the IR generator so it does not use the types NVPTX can't handle.

nikic commented 2 years ago

Simpler reproducer:

target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

define i24 @kernel() {
  ret i24 0
}

LegalizeInfo is a GlobalISel concept that should not be relevant for NVPTX.

From a cursory look, this seems to be a simple bug in LowerReturn: https://github.com/llvm/llvm-project/blob/8c278a27811ccf9d7a32c0a460b08069c4b3b7b5/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp#L2644

I think this should check getTypeSizeInBits() rather than getAllocSizeInBits(). For i24 the type size is 24, but the alloc size is 32.

Then an i24 return value will be handled by zero-extending to i32.

kjetilkjeka commented 2 years ago

Thanks for looking into this @nikic

I don't think it is entirely that simple. rustc promotes everything with size less than a pointer size to immediates. This means that the same must also be done for i40, i48 and i56. I also assume that a solution should preferably also work for non-divisible-by-eight bit widths?

I just found about the GlobalISel and that it is not used for NVPTX and that it is instead the NVPTXISelLowering I need to look at. I think it is correct that I need to do something similar to what you link from LowerReturn but I also think that it needs to be done for LowerCall and maybe also LowerFormalArguments.

I believe I can learn something from looking at ComputePTXValueVTs as it turns i128 into (i64, i64) for the mentioned functions. When attempting something similar I break a lot of tests. I assume this is due to transforming between types of different size and that I also need to do some bit extension of types. It's not possible to do this inside ComputePTXValueVTs as I do not have access to the DAG. I'm then not sure how it should interact with the ComputePTXValueVTs but my working hypothesis right now is that it needs to run directly after?

The status is more or less that I have managed to get the simple examples to compile correctly. But I am breaking tests and are trying to learn what I can from these. I'm slowly learning about how LLVM and NVPTX lowering works and I'm very happy for ideas and pointers like the one above.

kjetilkjeka commented 2 years ago

I made some progress today and believe I'm almost done!

The type promotion seems to work and can be viewed here I've added some tests here

I've not yet ran the formating tools and friends (and will of course do that before submitting a patch) but if there are any problems to the general structure please let me know so I can fix them.

The final problem is that b48 and b56 is not turned into b64 in function signatures and I've started working on that here. But I'm unfortunately missing this promotion a few more places as my test output from the i48 test is.

; CHECK: .visible .func (.param .b64 func_retval0) callee
         ^
<stdin>:1:1: note: scanning from here
//
^
<stdin>:11:1: note: possible intended match here
.visible .func (.param .b48 func_retval0) callee(
^

Input file: <stdin>
Check file: <path>/llvm-project/llvm/test/CodeGen/NVPTX/i48-param.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
           1: // 
check:6'0     X~~ error: no match found
           2: // Generated by LLVM NVPTX Back-End 
check:6'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           3: // 
check:6'0     ~~~
           4:  
check:6'0     ~
           5: .version 3.2 
check:6'0     ~~~~~~~~~~~~~
           6: .target sm_20 
check:6'0     ~~~~~~~~~~~~~~
           7: .address_size 32 
check:6'0     ~~~~~~~~~~~~~~~~~
           8:  
check:6'0     ~
           9:  // .globl callee // -- Begin function callee 
check:6'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          10:  // @callee 
check:6'0     ~~~~~~~~~~~~
          11: .visible .func (.param .b48 func_retval0) callee( 
check:6'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:6'1     ?                                                  possible intended match
          12:  .param .b48 callee_param_0 
check:6'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          13: ) 
check:6'0     ~~
          14: { 
check:6'0     ~~
          15:  .reg .b64 %rd<5>; 
check:6'0     ~~~~~~~~~~~~~~~~~~~
          16:  
check:6'0     ~
kjetilkjeka commented 2 years ago

I submitted a patch on phabricator today. https://reviews.llvm.org/D129291

kjetilkjeka commented 2 years ago

Closing as the fix have been merged