Incorrect DXIL bitcasts generated for bool matrices in ray payloads

jasilvanus commented 10 months ago

Description For bool matrices in ray payloads, DXC generates arrays of <n x i1> vectors and assumes an incorrect bit layout for these vectors, casting the <n x i1> vectors to <n x i32> vectors which use a different bit layout. Bit vectors are always packed and do not respect element alignment, so <n x i1> uses n bits, whereas <n x i32> uses 32n bits.

Steps to Reproduce HLSL source:

struct MyPayload
{
    bool1x2 input;
    int     output;
};

[shader("callable")]
void OuterCallable(inout MyPayload payload)
{
   payload.output = payload.input[0][1];
}

Actual Behavior

Compiling with dxc -T lib_6_6 yields:

target datalayout = "e-m:e-p:32:32-i1:32-i8:32-i16:32-i32:32-i64:64-f16:32-f32:32-f64:64-n8:16:32:64"
target triple = "dxil-ms-dx"

%struct.MyPayload = type { %class.matrix.bool.1.2, i32 }
%class.matrix.bool.1.2 = type { [1 x <2 x i1>] }

; Function Attrs: nounwind
define void @"\01?OuterCallable@@YAXUMyPayload@@@Z"(%struct.MyPayload* noalias nocapture %payload) #0 {
  %1 = getelementptr inbounds %struct.MyPayload, %struct.MyPayload* %payload, i32 0, i32 0
  %2 = bitcast %class.matrix.bool.1.2* %1 to <2 x i32>*
  %3 = load <2 x i32>, <2 x i32>* %2, align 4
  %4 = extractelement <2 x i32> %3, i32 1
  %5 = icmp ne i32 %4, 0
  %6 = zext i1 %5 to i32
  %7 = getelementptr inbounds %struct.MyPayload, %struct.MyPayload* %payload, i32 0, i32 1
  store i32 %6, i32* %7, align 4, !tbaa !13
  ret void
}

The problematic sequence is the following:

  %1 = getelementptr inbounds %struct.MyPayload, %struct.MyPayload* %payload, i32 0, i32 0
  %2 = bitcast %class.matrix.bool.1.2* %1 to <2 x i32>*
  %3 = load <2 x i32>, <2 x i32>* %2, align 4

This assumes 32-bit alignment of the i1 elements within the <2 x i1> vector. The data layout does specify an alignment of 32 bits for i1, but this is irrelevant for elements of vectors, which are always bit-packed, see https://llvm.org/docs/LangRef.html#t-vector.

Note that HLSL bool vectors (instead of matrices) in payloads are replaced by i32 arrays, which avoids this issue. Maybe the same should be done for matrices?

Environment

DXC version 1.8(dev;4371-fc0ecb83)
Host Operating System Ubuntu 22.04

llvm-beanz commented 10 months ago

The size of a bool in LLVM is language and target-defined. In HLSL bool is 32-bits. This is for legacy reasons, but as a result it cannot be bit-packed.

We don't have great documentation on this in the places you might expect, but it is documented in the Buffer Packing docs.

jasilvanus commented 10 months ago

The issue is not that bools are represented using 32 bits (which is fine!), but rather that DXC assumes an incorrect bit layout of the <2x i1> vector. Vectors are bit-packed in LLVM, so the two i1 in <2 x i1> use a single bit only, but the pointer bitcast to <2 x i32>* and subsequent load of <2 x i32> assumes 32 bits per value.

This causes incorrect values being read.

llvm-beanz commented 10 months ago

So... This is probably only a problem because you're interpreting the DXIL as an LLVM module, which is technically not how it is supposed to be interpreted. LLVM doesn't have the ability to express different data layouts per address space. The buffer packing rules in HLSL require that device memory be packed differently and we make assumptions that the packing is consistent.

In this case you're saying that the payload is somehow getting bit-packed in device memory (which violates the DXIL packing rules), and then we're reading it as if it was not bit-packed.

I think we probably need @tex3d to weigh in here.

jasilvanus commented 10 months ago

Indeed this issue is exposed by interpreting DXIL as modern LLVM IR. My main goal is to get a common understanding of the situation, and avoid issues in drivers that rely on importing DXIL modules into modern LLVM.

If your assessment is that this is valid DXIL but invalid LLVM IR and requires drivers that rely on modern LLVM to fixup DXIL before processing, we'll have to accept that. While modern LLVM is very clear that vectors are always bit packed, this fact at least is not explicitly documented in the LLVM 3.7 LangRef, so one might argue this code pattern to be potentially valid in DXIL.

(Note that pedantically speaking, DXIL using the legacy data layout is already invalid IR with respect to modern LLVM due to invalid i8 alignment.)

The issue is not just that drivers have to ensure the layout in memory matches what DXIL expects, but that using invalid LLVM IR can break in transforms. For example, consider the following slightly modified HLSL example:

struct MyPayload
{
    // Ensure accessing matrix requires a GEP.
    // Folding that GEP with GEPs into matrix breaks.
    int dummy;
    bool1x2 matrix;
};

[shader("callable")]
void OuterCallable(inout MyPayload outerPayload)
{
   MyPayload localPayload = outerPayload;
   outerPayload = localPayload;
}

Compiling with dxc -T lib_6_6 yields:

%struct.MyPayload = type { i32, %class.matrix.bool.1.2 }
%class.matrix.bool.1.2 = type { [1 x <2 x i1>] }

; Function Attrs: nounwind
define void @"\01?OuterCallable@@YAXUMyPayload@@@Z"(%struct.MyPayload* noalias nocapture %outerPayload) #0 {
  %1 = getelementptr inbounds %struct.MyPayload, %struct.MyPayload* %outerPayload, i32 0, i32 1
  %2 = bitcast %class.matrix.bool.1.2* %1 to <2 x i32>*
  %3 = load <2 x i32>, <2 x i32>* %2, align 4
  %4 = extractelement <2 x i32> %3, i32 0
  %5 = icmp ne i32 %4, 0
  %6 = extractelement <2 x i32> %3, i32 1
  %7 = icmp ne i32 %6, 0
  %8 = zext i1 %5 to i32
  %9 = zext i1 %7 to i32
  %10 = insertelement <2 x i32> undef, i32 %8, i32 0
  %11 = insertelement <2 x i32> %10, i32 %9, i32 1
  store <2 x i32> %11, <2 x i32>* %2, align 4
  ret void
}

Fixing i8 alignment and processing with opt -passes="vector-combine,instcombine" -S using LLVM ( bc265bd66) results in:

define void @"\01?OuterCallable@@YAXUMyPayload@@@Z"(ptr noalias nocapture %outerPayload) #0 {
  %1 = getelementptr inbounds %struct.MyPayload, ptr %outerPayload, i32 0, i32 1
  %2 = load i32, ptr %1, align 4
  %3 = icmp ne i32 %2, 0
  %4 = getelementptr inbounds %struct.MyPayload, ptr %outerPayload, i32 1
  %5 = load i32, ptr %4, align 4
  %6 = icmp ne i32 %5, 0
  %7 = zext i1 %3 to i32
  %8 = zext i1 %6 to i32
  %9 = insertelement <2 x i32> undef, i32 %7, i64 0
  %10 = insertelement <2 x i32> %9, i32 %8, i64 1
  store <2 x i32> %10, ptr %1, align 4
  ret void
}

Note the out-of-bounds %4 = getelementptr inbounds %struct.MyPayload, ptr %outerPayload, i32 1.

This is caused by vector-combine replacing (GEP, load, extractelement) by (GEP, GEP, load), and instcombine merging the two GEPs.

Even if we cannot agree on this being a DXC bug (I still think it is), it may be reasonable to consider whether DXC can be adapted with reasonable effort to avoid this issue. Potentially all DXIL vectors of overaligned elements (including 16-bit types with the legacy data layout) are affected. These options would avoid this issue:

Use i32 vectors instead of i1 vectors to represent boolean matrix rows.
- This feels most natural and would match what DXC is already doing for bool vectors. It doesn't help though for overaligned 16 bit types in the legacy data layout unless these are also replaced by i32.
Use an i1 array instead of vector (arrays respect alignment in contrast to vectors).
Avoid the bitcast and directly load the <2 x i1> vector.
- This would solve this apparent issue, but feels a bit fragile.

llvm-beanz commented 10 months ago

While modern LLVM is very clear that vectors are always bit packed, this fact at least is not explicitly documented in the LLVM 3.7 LangRef, so one might argue this code pattern to be potentially valid in DXIL.

One thing to be clear about here. DXIL may be a serialized LLVM module, but DXIL actually defines the behavior of many IR constructs differently than LLVM 3.7 did. Just because the 3.7 LangRef says something doesn't mean that's how it is supposed to be interpreted in DXIL.

The issue is not just that drivers have to ensure the layout in memory matches what DXIL expects, but that using invalid LLVM IR can break in transforms.

If you are reading DXIL into modern LLVM IR, you should be doing some sort of legalization pass to fix up the DXIL. As you commented the DataLayout is invalid, also the LLVM bitcode upgrader doesn't handle DXILs fast-math flags correctly, or even the floating point semantics that have changed between LLVM IR versions. We've seen a bunch of bugs as a result of people treating DXIL as if it is valid LLVM IR and trusting the bitcode upgrader.

Even if we cannot agree on this being a DXC bug (I still think it is), it may be reasonable to consider whether DXC can be adapted with reasonable effort to avoid this issue.

The approach we're taking for this is that as we implement DXIL-generation support in LLVM we're going to be implementing a set of transformations to transform DXIL into valid LLVM IR. We'd welcome collaboration on getting that working in the public LLVM tree.

Potentially all DXIL vectors of over-aligned elements (including 16-bit types with the legacy data layout) are affected.

Over-aligned vector elements is actually a whole other issue... We need to extend LLVM's DataLayout to represent a few things that we kinda "defined" in DXIL but didn't really extend LLVM IR for. Those include over-aligned vector elements, DXIL's odd vector alignment rules, and per-address space data layout rules.

jasilvanus commented 10 months ago

While modern LLVM is very clear that vectors are always bit packed, this fact at least is not explicitly documented in the LLVM 3.7 LangRef, so one might argue this code pattern to be potentially valid in DXIL.

One thing to be clear about here. DXIL may be a serialized LLVM module, but DXIL actually defines the behavior of many IR constructs differently than LLVM 3.7 did. Just because the 3.7 LangRef says something doesn't mean that's how it is supposed to be interpreted in DXIL.

Agreed. I was just unaware of vector bit-layouts being specified to behave differently for DXIL, and thus was assuming LLVM semantics. Is bit-layout of vectors in DXIL documented somewhere? I've read the buffer packing docs you've linked above, but it doesn't go into the details of how vectors (of overaligned elements) are laid out. The only hint in this direction I could find are the comments that bools and min-storage types have a "storage of 32 bits". Also, all of this seems to be rather on the HLSL side of things.

Based on what you say, my working understanding for DXIL semantics (not layout of buffers exposed to apps) now is:

vectors are laid out exactly as arrays, respecting elements' alignment
padding bits are treated as zero bits (this is required for the icmp in the example DXIL to make sense)

The approach we're taking for this is that as we implement DXIL-generation support in LLVM we're going to be implementing a set of transformations to transform DXIL into valid LLVM IR. We'd welcome collaboration on getting that working in the public LLVM tree.

Sounds interesting. Has already something been done upstream in this direction? I'm planning to implement a fixup for the data layout issue that may end up open-source, maybe this can at some point be integrated into the upstream work.

Potentially all DXIL vectors of over-aligned elements (including 16-bit types with the legacy data layout) are affected.

Over-aligned vector elements is actually a whole other issue... We need to extend LLVM's DataLayout to represent a few things that we kinda "defined" in DXIL but didn't really extend LLVM IR for. Those include over-aligned vector elements, DXIL's odd vector alignment rules, and per-address space data layout rules.

I'm not sure this is a whole other issue - the difference between LLVM IR and DXIL on this snippet is the layout of <2 x i1>, isn't it?

damyanp commented 5 months ago

We need to do some thinking about this so we can triage it appropriately. @tex3d, I think we need some of your help here (see above in the thread).

microsoft / DirectXShaderCompiler

Incorrect DXIL bitcasts generated for bool matrices in ray payloads #6082