Closed kjetilkjeka closed 2 years ago
I guess we should've promoted it up to the next larger power-of-2 sized integer.
This seems reasonable, it should work fine for the rustc issue! Let me know if there's anything I can do to help out.
Edit: Removed a concern that was based on a misunderstanding
Where did s24 come from? Is that coalesced {i8, i8, i8}
? If you were to pass the aggregate directly, it would work fine, though the copy would probably be done field-by-field.
Rustc turns aggregates smaller than the register size into an integer that can be placed in a single register when passing into or returning from a function. This optimization is only made when using the native rust abi, not the c or ptx-kernel abi. I discovered the bug when working on an image of Rgb<u8>
pixels with the following generic type
struct Rgb<T> {
r: T,
g: T,
b: T,
}
This bitcast from {i8, i8, i8}
into i24
before returning the value is done as an optimization. A comment in the rustc src states that "using a llvm aggregate type leads to bad optimization, so we pick an appropriately sized integer type instead.". But I'm not aware of any details other than that.
A temporary workaround is to use the C (extern "C" fn
) function abi. This can not unfortunately not be done in all cases and may lead to worse performance in some cases where it is possible to use.
After looking a bit more into this today I have a couple of questions regarding the following (redacted) Rust code that is generated into the following (redacted) llvm-ir.
#[repr(C)]
#[derive(Clone, Copy)]
pub struct Foo {
a: u8,
b: u8,
c: u8,
}
#[inline(never)]
fn device(v: Foo) -> Foo {
v
}
%Foo = type { i8, i8, i8 }
; main_rs::device
; Function Attrs: noinline nounwind
define internal i24 @_ZN7main_rs6device17h534b9a1ae91055fbE(i24 %0) unnamed_addr #1 {
start:
%1 = alloca %Foo, align 1
%2 = alloca i24, align 4
%v = alloca %Foo, align 1
store i24 %0, i24* %2, align 4
%3 = bitcast %Foo* %v to i8*
%4 = bitcast i24* %2 to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 %3, i8* align 4 %4, i64 3, i1 false)
%5 = bitcast %Foo* %1 to i8*
%6 = bitcast %Foo* %v to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 %5, i8* align 1 %6, i64 3, i1 false)
%7 = bitcast %Foo* %1 to i24*
%8 = load i24, i24* %7, align 1
ret i24 %8
}
Foo
and i24
/i32
Rustc generates llvm-ir which uses the type Foo
as the llvm-ir type %Foo = type { i8, i8, i8 }
and as the llvm-ir type i24
. Based on my observations, it seems like it is only bitcast into i24
when returning from or passing into functions.
If I understand correctly the reason this works is that bitcast
is only being done on the pointers and not the aggregate type itself. The Foo
and i24
is being converted between by doing a memcpy instead. I see that the pointer bitcast
would still be compatible if we promoted i24
to i32
.
My questions is: Are there any problems associated with only copying 3 bytes into the i32
that will be returned and leaving 8bits as undefined?
I had a look in llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
but I couldn't figure out exactly how i24
is being lowered to ptx in general. My impression, mostly from looking at the methods and reading some code, is that it happens for all instructions individually. If this is correct then I assume some code needs to be appended to all the relevant methods (like LowerLOAD
, LowerSTORE
, LowerReturn
, etc) and promote types to next level of power-of-two.
Is this understanding correct? Are you (@Artem-B) actively working on it or would it be beneficial if I took a stab at it?
Are you (@Artem-B) actively working on it or would it be beneficial if I took a stab at it?
It's not at the top of my todo list and I don't know when I'll get to it, so feel free to poke at it.
how i24 is being lowered to ptx in general.
It's not, which is the reason for the crash you've reported. Considering that PTX does not have 24-bit load/store variant, if you insist on operating on i24 types, it would have to be split into smaller loads/stores. E.g. a triple of i8 ops, or a pair of i8/i16 or i16/i8, depending on alignment and some glue code to split/merge those bits to/from i24. That would be relatively slow.
Performance-wise it would probably be better not to use i24 and, if you do need to pack into integers, pack into power-of-2 sized ones that can be lowered into efficient code.
I'm not quite sure what drives this hand-packing of aggregates on IR level. Is that part of the Rust ABI? Normally it's target-dependent. E.g. on x64 small aggregates are passed in registers, power-of-2 sized, at that. Something like that may make sense for the rust-generated code for NVPTX.
I had the chance to look a bit more into this today and due to being unfamiliar with the LLVM structure I hope to get my perceived understanding confirmed and maybe also some pointers in the right direction.
It seems to me like the Legalizer
exists partially for turning non-supported llvm types into supported ones already on the LLVM-IR level (before lowering). It also seems like there are no implementation of class NVPTXLegalizerInfo : public LegalizerInfo
which is the reason the Cannot select:
occurs for nvptx64-nvidia-cuda
, where other architectures with a Legalizer (like x86_64) will have the i24
turned into something else. Is this understanding correct?
Is the solution to the problem to add the Legalizer
for NVPTX target? Are there any problems by doing this? A chance of breaking existing ptx code or similar?
The implementation for Legalizer
seems quite different for different targets. X86
seems like something I expected, AMDGPU
seems surprisingly long and detail oriented, while RISC-V gets away with a single call getLegacyLegalizerInfo().computeTables();
. At the same time it is mentioned in the header that these files should be generated by TableGen
and the word Legacy
occurs a lot when reading the code of existing Legalizer
. What is the best approach to finding out exactly what the NVPTXLegalizerInfo
should look like?
Legalizer does sound like the right place to handle this.
A lot of things in LLVM grow 'organically' so the new features/mechanisms are not always automatically propagated to all back-ends. In general, small reasonable changes (as in both understandable and sensible for the issue at hand) are the way to go. As long as they move LLVM in the right direction (incremental steps are fine), have tests, and don't break anyone upstreaming them should be pretty straightforward.
I'm struggling a bit with making progress. It seems like in the NVPTXTargetLowering
constructor it is possible to register combination of operations and registers as Legal
, Custom
, Promote
, etc to get the desired behavior using targetLoweringBase::setOperationAction
. The problem is that the function only accepts MVT
while, if I understand correctly, types like i24
is an EVT
.
Do you have a pointer in the right direction of where the corresponding EVT
behavior can be found?
I don't really know. If the standard lowering machinery can't be used with odd-sized integers, I'd try checking what happens if the same code gets compiled by a back-end that can deal with it. Track what they do differently and that may provide enough of a hint on how to do it in NVPTX.
It may be easier to consider changing the IR generator so it does not use the types NVPTX can't handle.
Simpler reproducer:
target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
define i24 @kernel() {
ret i24 0
}
LegalizeInfo is a GlobalISel concept that should not be relevant for NVPTX.
From a cursory look, this seems to be a simple bug in LowerReturn: https://github.com/llvm/llvm-project/blob/8c278a27811ccf9d7a32c0a460b08069c4b3b7b5/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp#L2644
I think this should check getTypeSizeInBits() rather than getAllocSizeInBits(). For i24
the type size is 24, but the alloc size is 32.
Then an i24
return value will be handled by zero-extending to i32
.
Thanks for looking into this @nikic
I don't think it is entirely that simple. rustc promotes everything with size less than a pointer size to immediates. This means that the same must also be done for i40
, i48
and i56
. I also assume that a solution should preferably also work for non-divisible-by-eight bit widths?
I just found about the GlobalISel and that it is not used for NVPTX and that it is instead the NVPTXISelLowering
I need to look at. I think it is correct that I need to do something similar to what you link from LowerReturn
but I also think that it needs to be done for LowerCall
and maybe also LowerFormalArguments
.
I believe I can learn something from looking at ComputePTXValueVTs
as it turns i128
into (i64, i64)
for the mentioned functions. When attempting something similar I break a lot of tests. I assume this is due to transforming between types of different size and that I also need to do some bit extension of types. It's not possible to do this inside ComputePTXValueVTs
as I do not have access to the DAG
. I'm then not sure how it should interact with the ComputePTXValueVTs
but my working hypothesis right now is that it needs to run directly after?
The status is more or less that I have managed to get the simple examples to compile correctly. But I am breaking tests and are trying to learn what I can from these. I'm slowly learning about how LLVM and NVPTX lowering works and I'm very happy for ideas and pointers like the one above.
I made some progress today and believe I'm almost done!
The type promotion seems to work and can be viewed here I've added some tests here
I've not yet ran the formating tools and friends (and will of course do that before submitting a patch) but if there are any problems to the general structure please let me know so I can fix them.
The final problem is that b48
and b56
is not turned into b64
in function signatures and I've started working on that here. But I'm unfortunately missing this promotion a few more places as my test output from the i48
test is.
; CHECK: .visible .func (.param .b64 func_retval0) callee
^
<stdin>:1:1: note: scanning from here
//
^
<stdin>:11:1: note: possible intended match here
.visible .func (.param .b48 func_retval0) callee(
^
Input file: <stdin>
Check file: <path>/llvm-project/llvm/test/CodeGen/NVPTX/i48-param.ll
-dump-input=help explains the following input dump.
Input was:
<<<<<<
1: //
check:6'0 X~~ error: no match found
2: // Generated by LLVM NVPTX Back-End
check:6'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3: //
check:6'0 ~~~
4:
check:6'0 ~
5: .version 3.2
check:6'0 ~~~~~~~~~~~~~
6: .target sm_20
check:6'0 ~~~~~~~~~~~~~~
7: .address_size 32
check:6'0 ~~~~~~~~~~~~~~~~~
8:
check:6'0 ~
9: // .globl callee // -- Begin function callee
check:6'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10: // @callee
check:6'0 ~~~~~~~~~~~~
11: .visible .func (.param .b48 func_retval0) callee(
check:6'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:6'1 ? possible intended match
12: .param .b48 callee_param_0
check:6'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
13: )
check:6'0 ~~
14: {
check:6'0 ~~
15: .reg .b64 %rd<5>;
check:6'0 ~~~~~~~~~~~~~~~~~~~
16:
check:6'0 ~
I submitted a patch on phabricator today. https://reviews.llvm.org/D129291
Closing as the fix have been merged
I'm currently working on improving nvptx support in rust. The rust compiler produces llvm-ir with struct return values bitcast into integer values. This is fine for most targets, but for the triple
nvptx64-nvidia-cuda
llc
cannot select for these types. A description going deeper into the rust side of things can be found here I will try to keep the rest of this issue focused only on the llvm side.The problem from the LLVM side of things is that well formed llvm-ir cannot be compiled into ptx. I'm not sure if it is coincidental that clang always uses the actual struct as the return type, or if it is because it is better to work around this bug than to fix it. I do think that it will be a ugly hack in rustc to special case this for the nvptx target, and is thus willing to put a little amount of work into helping out fixing this bug if it is a realistic option.
I do realize this is only tested on a rather old version of llvm. I looked at your issue tracker and do not believe it is fixed. I apologize if this is a duplicate or have already been fixed.
Simplified example code
I believe this is a LLVM backend bug, and therefore the most interesting source is perhaps a simplified .ll example. The following is generated with bugpoint:
Backtrace
When attempting to compile this with
llc-14
from the Ubuntu 22.04 apt repo, the following is returned:Meta
My exact version of
llc-14
is: