Implemented GPU OpenCL runtime

AndreyPavlenko commented 2 months ago

How to use:

  // Create a builder. The 'module' argument is an MLIR module
  // with a single function to be executed.
  OclModuleBuilder builder(module);
  // Build the module with one of the build() methods, that takes
  // either runtime (preferred), OpenCL device/context or queue.
  // The module is built for each device/context pair and cached.
  auto mod = gcGetOrReport(builder.build(device, context));
  // Create an execution context. The 'queue' argument is an OpenCL queue.
  OclContext ctx(mod->runtime, queue);
  // Create an executor.
  if (mod->isStatic) {
    // If all the function arguments are memrefs with static shapes
    // use this one.
    StaticExecutor exec(mod);
    // Add the function arguments - aligned memory buffers.
    exec.arg(buf0);
    exec.arg(buf1);
    exec.arg(buf2);
    // Execute the function.
    exec(ctx);

    // Or in a single line
    exec(ctx, buf0, buf1, buf2);
  } else {
    // Dynamic shapes are not currently supported.
    DynamicExecutor exec(mod);
    exec.arg(buf0, 2, shape, strides);
    exec.arg(buf1, 2, shape, strides);
    exec.arg(buf2, 2, shape, strides);
    exec(ctx);
  }

See the unit test. Depends on #333 and #329

dchigarev commented 1 month ago

I think I'm more or less good with the changes (OV integration works with this runtime). The only thing that keeps me from merging this PR is that we have to temporarily disable XeGPU tests in GC until the gpu-runner is merged.

@AndreyPavlenko is there a way of how we can merge this PR and keep both gc-cpu-runner and GPURuntime tests working? Maybe some temporary option in gc-gpu-pipeline that disables new passes (something like legacyOCLRuntime=true)?

Also, @kurapov-peter, what do you think on merging this one without a tedious review? I think it's the last puzzle piece that keeps us from claiming that OV integration works on GC main branch (technically we also need this one, but we already have an approve there)

AndreyPavlenko commented 1 month ago

The runner is already implemented and the tests pass with it. There are not many changes - https://github.com/intel/graph-compiler/pull/362/commits/9adeab8771817a887c287b61d7d48877b8e800cf

AndreyPavlenko commented 1 month ago

I wonder if we really need another logger along with the llvm's one

To be honest, I don't like the llvm's logger. This one is easier to use:

gcLogD("This is a debug message");

VS

LLVM_DEBUG(llvm::dbgs() << "This is a debug message\n");

Also, for debug builds, it prints the file and line number, that makes it convenient for in-IDE navigation - single click on a log message navigates to the corresponding line.

[DEBUG] [/path/to/graph-compiler/lib/gc/ExecutionEngine/GPURuntime/ocl/GpuOclRuntime.cpp:432] Created new OpenCL context: 0x560643946d20
[DEBUG] [/path/to/graph-compiler/lib/gc/ExecutionEngine/GPURuntime/ocl/GpuOclRuntime.cpp:507] Created new OpenCL command queue: 0x560642affff0
[DEBUG] [/path/to/graph-compiler/lib/gc/ExecutionEngine/GPURuntime/ocl/GpuOclRuntime.cpp:523] Allocated 16384 bytes of device USM memory: 0xff00fffffffe0000

But, if required, I could integrate this logger with llvm's one. It's quite simple.

AndreyPavlenko commented 1 month ago

Maybe some temporary option in gc-gpu-pipeline that disables new passes (something like legacyOCLRuntime=true)?

Added the use-gpu-ocl pipeline option.

intel / graph-compiler

Implemented GPU OpenCL runtime #343