m4rs-mt / ILGPU

ILGPU JIT Compiler for high-performance .Net GPU programs
http://www.ilgpu.net
Other
1.35k stars 116 forks source link

Fixed alignment when marshalling Cuda kernel parameters. #1176

Closed MoFtZ closed 5 months ago

MoFtZ commented 6 months ago

Fixes #1145.

When launching a Cuda kernel, ILGPU will create a runtime struct that represents all the kernel parameters. These kernel parameters are then passed to cuLaunchKernel as a single buffer. It is up to the caller of cuLaunchKernel (i.e. ILGPU) to make sure the alignment/padding of this buffer is correct.

This runtime struct is created using the alignment/padding rules of the .NET runtime. Using the example of: Kernel(Index1D index, ArrayView1D<Int128, Stride1D.Dense> output, Int128 constant)

.NET would create a struct as:

The additional padding at Offset 24 does not match the alignment rules of Cuda - it is expecting to find the Int128 at Offset 24.

Changing the order: Kernel(Index1D index, Int128 constant, ArrayView1D<Int128, Stride1D.Dense> output)

.NET would create a struct as:

Again, the additional padding at Offset 4 does not match the alignment rules of Cuda - it is expecting to find the Int128 at Offset 8.


Attempt 1

~~Cuda will handle the alignment/padding for us. ILGPU continues to use the single buffer as the placeholder/container of all the kernel parameters.~~

Attempt 2

Attempt 3 Looks like the issue is specific to Int128. It is not treated as a regularly struct. The .NET Runtime considers it an intrinsic type, and aligns it to 16 bytes. Modified ILGPU to pre-register Int128, and force a 16 byte alignment.

MoFtZ commented 5 months ago

@m4rs-mt just a thought... this works for the top-level structure. what about the nested structures? I dont think it will work there.

MoFtZ commented 5 months ago

@m4rs-mt OK, this is definitely a problem.

Placing Int128 inside another struct, and passing MyStruct as a kernel parameter will also cause alignment/padding issues.

public struct MyStruct
{
    public byte X;
    public Int128 Y;
}

Y is aligned at Offset 16 in .NET, but Cuda expects it at Offset 8.

MoFtZ commented 5 months ago

@m4rs-mt found another issue - if the kernel parameter is a structure containing a nested structure, ILGPU will flatten all the fields. This process of flattening can potentially change the byte offset of fields, which means that we can no longer use the .NET representation of the variable when using fixed.