m4rs-mt / ILGPU

ILGPU JIT Compiler for high-performance .Net GPU programs
http://www.ilgpu.net
Other
1.37k stars 115 forks source link

Feature request: allow lambdas in kernels when they can be evaluated at compile time #463

Open lostmsu opened 3 years ago

lostmsu commented 3 years ago

Rationale

This request is syntax sugar for creating C# classes, that provide some GPGPU capabilities.

Imagine you are trying to implement a ISqlCalc, that needs to be able to perform a few ops on arrays using ILGPU.

interface ISqlCalc {
  int[] Neg(int[] a);
  int[] BitwiseComplement(int[] a);
}

class GpuSqlCalc: ISqlCalc {
  static void UnaryOpKernel(Index1 i, ArrayView<int> data, Func<int, int> op)
    => data[i] = op(data[i]);

  static Action<int[]> UnaryOp(Func<int, int> op) {
    return accelerator.LoadAutoGroupedStreamKernel<
                Index1,
                ArrayView<int>
                >((i, d) => GenericUnaryOp(i, d, op));
  }

  public int[] Neg(int[] v) => UnaryOp(v => -v);
  public int[] BitwiseComplement(int[] v) => UnaryOp(v => ~v);
}

Point is it should be possible to inline v => -v. The delegate instance will have MethodInfo pointing to a body, and that method will never reference this, so it is essentially static.

Workaround

Currently the best way to have something analogous to UnaryOpKernel shared for all unary ops I came up with is to use generic monomorphization like this:

interface IUnaryOp<T> { T Apply(T val); }

static void UnaryOpKernel<TOp>(Index1 i, ArrayView<int> data)
  where TOp: struct, // this fails with a class, but really should not in this particular scenario
             IUnaryOp<int>
{
  return data[i] = default(TOp).Apply(data[i]);
}

struct Neg: IUnaryOp<int> { int Apply(int val) => -val; }

accelerator.LoadAutoGroupedStreamKernel<
                Index1,
                ArrayView<int>
                >(UnaryOpKernel<Neg>)

While this works, it is ugly and unnecessarily wordy.

The struct restriction also prevents me from at least doing

class Neg: BaseOp, IUnaryOp<int> {
  ... overrides of BaseOp stuff, that call into UnaryOpKernel<Neg> ...

  public int Apply(int val) => -val;
}

This fails due to "Class type 'Neg' is not supported" even though this is never used and Apply is essentially static.

lostmsu commented 3 years ago

Hm, I started working on this, and I am seeing existing pieces of code that look very relevant: MethodExtensions.IsNotCapturingLambda.

@MoFtZ MethodExtensions.GetParameterOffset seems to be returning wrong value for a simple class with no fields or properties. What was the reasoning for it to return 0 for lambdas? AFAIK lambdas are implemented as instance methods on a hidden class, so it should have returned 1.

m4rs-mt commented 3 years ago

@lostmsu Thank you for your feature request. We have already discussed the feature in our weekly talk-to-dev sessions. We currently believe that we should add support for lambdas via ILGPU's dynamic specialization features. Also, we can translate calls to lambda functions into calls to "opaque" functions annotated with specific attributes. This avoids inlining and modifying these stubs that we generate.

However, adding support for arbitrary lambdas also requires special care in capturing values and returning lambda closures within kernel functions. Moreover, we can add this feature to the v1.1 feature list 🚀

lostmsu commented 3 years ago

@m4rs-mt thanks for the promising response. Is there anyone already working on that feature?

I started my own take at implementing it by replacing the key type in this dictionary: https://github.com/m4rs-mt/ILGPU/blob/93b6551fdc960bede5246dc8ebedc5f2ee773411/Src/ILGPU/Frontend/ILFrontend.cs#L455 to a composite of MethodBase + Value?[] array of arguments whose values are known at compile time (in this case a delegate pointing to a known method). This approach does not seem to align with the idea of "dynamic specialization features". Should I pause it?

MoFtZ commented 3 years ago

@lostmsu Thanks for looking into this topic.

Yes, you are correct that lambdas are implemented as instance methods on a hidden class. Originally, ILGPU only supported static methods, which do not have a this pointer. When adding support for non-capturing lambdas, we are removing the this pointer from the lambda and treating it like a static method. This means that arguments are shifted, and the parameter offset is 0, the same as for a static method.

If you find that it is easier to make your changes if the parameter offset is 1, then it is fine to change.

MoFtZ commented 3 years ago

@m4rs-mt thanks for the promising response. Is there anyone already working on that feature?

I started my own take at implementing it by replacing the key type in this dictionary:

https://github.com/m4rs-mt/ILGPU/blob/93b6551fdc960bede5246dc8ebedc5f2ee773411/Src/ILGPU/Frontend/ILFrontend.cs#L455 to a composite of MethodBase + Value?[] array of arguments whose values are known at compile time (in this case a delegate pointing to a known method). This approach does not seem to align with the idea of "dynamic specialization features". Should I pause it?

@lostmsu There is no one currently working on this feature, so if you have the time and passion, we would wholeheartedly welcome your contributions.

We have previously discussed how to support lambda functions to provide the functionality requested. In your example, you have supplied the lambda function as a method parameter to UnaryOp, which then calls LoadAutoGroupedStreamKernel using a lambda function that captures Func<int, int> op. This is related, but different, to #415 which uses a static member variable as the technique for supplying the lambda function.

Regarding "dynamic specialization features", I believe @m4rs-mt is referring to a technique similar to SpecializedValue in ILGPU: https://github.com/m4rs-mt/ILGPU/wiki/Dynamically-Specialized-Kernels The idea is that calling LoadXxxKernel does an initial compilation of the kernel. Then, when actually launching a kernel that uses SpecializedValue, a further compilation phase is performed that will "dynamically specialize" the kernel. With regards to lambda functions, it could be something like having SpecializedFunc (or more generically, SpecializedDelegate) as a kernel parameter.

Note that this is still an open-ended discussion. For example, should we support lambdas that are static member variables like #415? Is dynamic specialization the correct approach for how it will be used? Should capturing lambdas be supported? And if so, to what extent? Also note that is is not necessary to solve all these questions now - we can slowly build some functionality while deferring other more "problematic" functionality, like capturing lambdas.

lostmsu commented 3 years ago

@MoFtZ the problem I see with the LoadXxxKernel followed by its launch with a SpecializedValue is that the original kernel would need to support non-specialized lambda values, and I currently do not see how they could be compiled: their usage involves IL opcode ldftn, and eventually boils down to an indirect function call, which AFAIK (I am not expert on GPGPU) is only available in very recent hardware.

That was my reasoning behind the idea to propagate lambda at the initial compile time.

m4rs-mt commented 3 years ago

@lostmsu @lostmsu I don't think we'll run into any problems with respect to the ldftn opcode when translating it into an IR function call to an opaque function. Consequently, we can resolve the call target at kernel launch time by providing a function to the kernel and leaving the specialization work to the ILGPU compiler. However, this generally does not cover all use cases 😄

@lostmsu Regarding your suggestion and implementation: I have experimented with different ways to implement lambdas in the compiler, as they involve handling class types inside kernels. I still believe that mapping these OpCodes to partial function calls + dynamic specialization of the call sites might be the best way to implement them. Anyway, we are always open to PRs that add new features 🤓👍

I was wondering about changing the mapping

to a composite of MethodBase + Value?[] array of arguments whose values are known at compile time (in this case a delegate pointing to a known method). This approach does not seem to align with the idea of "dynamic specialization features". Should I pause it?

to a tuple of a MethodBase and a Value array. Is the value array intended to represent captured variables from the environment of the function? And where do these values come from? Are they created by the IRBuilder from .Net values? If yes, how do we compare them "properly" for equality? I ask about equality checking because primitive constants are instantiated multiple times and are not treated as the same value in the compiler for efficiency. In other words, the integer constant 1 and another constant 1 will not be the same value in memory.

lostmsu commented 3 years ago

Sorry for a delay here @MoFtZ @m4rs-mt . Have you guys given any thought to this? Do you have notes?

I checked out current code, that handles SpecializedValue, and as-is it seems to be tailored to the scenarios where the value being specialized is already one of the supported values (which delegate instances are not). It might be possible to rework it a bit to get identical behavior, but disallow running generic kernels, that have unspecialized parameters of reference types. Or just explicitly add a different GenericValue<T>, which behaves exactly like SpecializedValue<T>, but must always be specialized.

@m4rs-mt mentioned dynamic specialization. Can you elaborate on the idea? Is it different from the above?

I have not looked at it, but if ILGPU already has cross-function constant propagation that might be another way to approach the problem.

MoFtZ commented 3 years ago

@lostmsu We have not defined a preferred API, so you are welcome to design it as you see fit.

I believe that "dynamic specialization" is referring to the concept used by SpecializedValue<T>. That is, when the kernel is launched, it will be provided with the delegate as a parameter. This delegate will then be integrated into the final kernel that runs on the GPU.

lostmsu commented 3 years ago

@MoFtZ @m4rs-mt is there some architectural description of ILGPU? I find it hard to wrap my head around existing translation phases, values, and IR without one.

MoFtZ commented 3 years ago

There is no such documentation at the moment. If you'd like to join us on Discord, we will try to answer any questions you have: https://discord.com/invite/X6RBCff

At a very high level, ILGPU follows a typical compiler design, with a Frontend that decodes MSIL into an Intermediate Representation (IR): https://github.com/m4rs-mt/ILGPU/blob/v1.0-beta1/Src/ILGPU/Frontend/DisassemblerDriver.cs https://github.com/m4rs-mt/ILGPU/blob/v1.0-beta1/Src/ILGPU/Frontend/ILFrontend.cs#L473 https://github.com/m4rs-mt/ILGPU/blob/v1.0-beta1/Src/ILGPU/Frontend/CodeGenerator/Driver.cs

Several optimisation phases are performed on this IR: https://github.com/m4rs-mt/ILGPU/blob/v1.0-beta1/Src/ILGPU/IR/Transformations/Optimizer.cs

And finally, the IR is transformed using the Backends, to target Cuda or OpenCL: https://github.com/m4rs-mt/ILGPU/blob/v1.0-beta1/Src/ILGPU/Backends/CodeGeneratorBackend.cs#L72

Additional resources: https://www.tutorialspoint.com/compiler_design/index.htm https://en.wikipedia.org/wiki/Static_single_assignment_form

lostmsu commented 1 year ago

This now might be easier with new C# static abstract interface members. Relevant IL changes: https://github.com/dotnet/runtime/pull/49558/files

MoFtZ commented 1 year ago

@lostmsu We recently added support for Generic Math, which makes use of Static Abstract Interface members. If you would like to try it out, it is available in a preview release of ILGPU.

Darelbi commented 6 months ago

I need exactly that, assume I have a dynamic composition of different algorithms ( NeuraSharp). Also something like that would be usefull:

Declare the interfaces with static methods:

public interface IAlgorithm1 { public static abstract void DoAlgorithm(float[] input, float[] ouput) ; }

public interface IFunction1 { public static abstract float DoSum(float[] input); } And then implement them:

public class MyAlgorithm1 : IAlgorithm1 where T : IFunction1 { public static void DoAlgorithm(float[] input, float[] output) { for(int j=0;j<output.Length;j++) { output[j] = 2.0f* T.DoSum(input); // call to the static method of the generic type } } }

public class NormalSum1 : IFunction1 { public static float DoSum(float[] input) { float sum = 0.0f; for (int i = 0; i < input.Length; i++) sum += input[i]; return sum; } }

// load this as kernel MyAlgorithm1.DoAlgorithm;

Actually I'm lookin at how to generate automatically inlined IL code but is a daunting task, if the feature is already there that would be great...

What kinda of syntax is exactly supported in the preview just out of curiosity?

MoFtZ commented 6 months ago

hi @Darelbi.

This is a long-running thread, so the information is outdated.

Currently, using lambdas within a kernel is still not supported.

On the plus side, Generic Math and Static Abstract Interface Member support (for net.70 onwards) is no longer in preview, and is available in the latest version of ILGPU - currently v1.5.1.

There is also some sample code that might meet your requirements for using interfaces: https://github.com/m4rs-mt/ILGPU/blob/master/Samples/StaticAbstractInterfaceMembers/Program.cs

En3Tho commented 6 months ago

Generic math works really well! Here is a small snippet in F# if you're interested.

module ILGpu.GenericKernels

open System
open System.Numerics
open ILGPU
open ILGPU.Runtime
open En3Tho.FSharp.Extensions

// define a set of constraints, INumber + ILGpu default ones
type Number<'TNumber
    when 'TNumber: unmanaged
    and 'TNumber: struct
    and 'TNumber: (new: unit -> 'TNumber)
    and 'TNumber :> ValueType
    and 'TNumber :> INumber<'TNumber>> = 'TNumber

module Kernels =

    // use this constraint for generic parameter in the kernel
    let inline executeSomeNumericOperations<'TNumber when Number<'TNumber>> (index: Index1D) (input: ArrayView<'TNumber>) (output: ArrayView<'TNumber>) (scalar: 'TNumber) =
        if index.X < input.Length.i32 then
            output[index] <- (input[index] * scalar + scalar) / scalar - scalar

let runKernel<'T when Number<'T>> (accelerator: Accelerator) scalar (data: 'T[]) =
    use deviceData = accelerator.Allocate1D(data)
    let kernel = accelerator.LoadAutoGroupedStreamKernel(Kernels.executeSomeNumericOperations<'T>)

    kernel.Invoke(Index1D(deviceData.Length.i32), deviceData.View, deviceData.View, scalar)
    deviceData.CopyToCPU(accelerator.DefaultStream, data)

    data |> Array.iteri ^ fun index element -> Console.WriteLine($"{index} = {element}")

let genericMap() =
    use context = Context.CreateDefault()
    let device = context.Devices |> Seq.find ^ fun x -> x.Name.Contains("GTX 1070")
    use accelerator = device.CreateAccelerator(context)

    // run with ints
    runKernel accelerator 10 [| 0; 1; 2; 3; 4; 5; 6; 7; 8; 9; |]
    // and with floats
    runKernel accelerator 10.1f [| 0.1f; 1.1f; 2.1f; 3.1f; 4.1f; 5.1f; 6.1f; 7.1f; 8.1f; 9.1f; |]