Fixed alignment when marshalling Cuda kernel parameters.

MoFtZ commented 6 months ago

Fixes #1145.

When launching a Cuda kernel, ILGPU will create a runtime struct that represents all the kernel parameters. These kernel parameters are then passed to cuLaunchKernel as a single buffer. It is up to the caller of cuLaunchKernel (i.e. ILGPU) to make sure the alignment/padding of this buffer is correct.

This runtime struct is created using the alignment/padding rules of the .NET runtime. Using the example of: Kernel(Index1D index, ArrayView1D<Int128, Stride1D.Dense> output, Int128 constant)

.NET would create a struct as:

Offset 0 = Index1D (4 bytes)
Offset 4 = Padding (4 bytes)
Offset 8 = ArrayView (16 bytes)
Offset 24 = Padding (24 bytes)
Offset 48 = Int128 (16 bytes)

The additional padding at Offset 24 does not match the alignment rules of Cuda - it is expecting to find the Int128 at Offset 24.

Changing the order: Kernel(Index1D index, Int128 constant, ArrayView1D<Int128, Stride1D.Dense> output)

.NET would create a struct as:

Offset 0 = Index1D (4 bytes)
Offset 4 = Padding (12 bytes)
Offset 16 = Int128 (16 bytes)
Offset 32 = ArrayView (16 bytes)

Again, the additional padding at Offset 4 does not match the alignment rules of Cuda - it is expecting to find the Int128 at Offset 8.

Attempt 1

~~This PR modifies how ILGPU provides the kernel parameters to Cuda. Instead of a single buffer that needs to be manually aligned, we now use an array of pointers, to the kernel parameters.~~

~~Cuda will handle the alignment/padding for us. ILGPU continues to use the single buffer as the placeholder/container of all the kernel parameters.~~

Attempt 2

~~Changed the PTX Argument Mapper to manually align the kernel parameter fields.~~

Attempt 3 Looks like the issue is specific to Int128. It is not treated as a regularly struct. The .NET Runtime considers it an intrinsic type, and aligns it to 16 bytes. Modified ILGPU to pre-register Int128, and force a 16 byte alignment.

MoFtZ commented 5 months ago

@m4rs-mt just a thought... this works for the top-level structure. what about the nested structures? I dont think it will work there.

MoFtZ commented 5 months ago

@m4rs-mt OK, this is definitely a problem.

Placing Int128 inside another struct, and passing MyStruct as a kernel parameter will also cause alignment/padding issues.

public struct MyStruct
{
    public byte X;
    public Int128 Y;
}

Y is aligned at Offset 16 in .NET, but Cuda expects it at Offset 8.

MoFtZ commented 5 months ago

@m4rs-mt found another issue - if the kernel parameter is a structure containing a nested structure, ILGPU will flatten all the fields. This process of flattening can potentially change the byte offset of fields, which means that we can no longer use the .NET representation of the variable when using fixed.

m4rs-mt / ILGPU

Fixed alignment when marshalling Cuda kernel parameters. #1176