intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.22k stars 730 forks source link

[SYCL-MLIR] SYCL-Bench investigation #8204

Closed whitneywhtsang closed 1 year ago

whitneywhtsang commented 1 year ago

This issue is to fix the failures on SYCL-Bench using SYCL-MLIR compiler, and measure performance.

Repository: https://github.com/bcosenza/sycl-bench

To build the benchmarks:

mkdir build; cd build
cmake -DSYCL_IMPL=LLVM -DCMAKE_CXX_COMPILER=/iusers/waihungt/llvm/build/bin/clang++ ..
cmake --build .

Changes made to CMakeLists.txt to use SYCL-MLIR compiler:

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -42,7 +42,7 @@ if(SYCL_IMPL STREQUAL "ComputeCpp")
 elseif(SYCL_IMPL STREQUAL "hipSYCL")
   find_package(hipSYCL CONFIG REQUIRED)
 elseif(SYCL_IMPL STREQUAL "LLVM")
-  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl")
+  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl -fsycl-targets=spir64-unknown-unknown-syclmlir -w")

Below are the benchmarks that have compile failures with SYCL-MLIR compiler:

-  single-kernel/median.cpp
-  single-kernel/scalar_prod.cpp
-  single-kernel/sobel.cpp
-  single-kernel/sobel5.cpp
-  single-kernel/sobel7.cpp
-  single-kernel/mol_dyn.cpp
-  single-kernel/nbody.cpp
-  pattern/segmentedreduction.cpp
-  pattern/reduction.cpp
-  runtime/dag_task_throughput_sequential.cpp
-  runtime/dag_task_throughput_independent.cpp
-  runtime/blocked_transform.cpp
-  runtime/matmulchain.cpp

To run the successful benchmarks:

for benchmark in 2DConvolution 2mm 3mm atax bicg covariance gemm gesummv gramschmidt matmulchain mvt syr2k syrk; do
  echo "Running $benchmark"
  ./$benchmark --device=gpu --output=output.csv
done

Below are the benchemarks that have runtime failures with either SYCL-MLIR or both DPC++ and SYCL-MLIR compiler:

#CLANG FAIL/UNSUPPORTED: 3DConvolution correlation DRAM fdtd2d host_device_bandwidth
#In addition:
#CGEIST FAIL TO RUN: arith kmeans lin_reg_coeff lin_reg_error local_mem pattern_L2 sf vec_add
whitneywhtsang commented 1 year ago

Performance measurements on Intel(R) Iris(R) Xe Graphics: t t

Code generated by SYCL-MLIR are incorrect for the above benchmarks!

whitneywhtsang commented 1 year ago

There are bugs in the Gramshmidt benchmark:

--- a/polybench/gramschmidt.cpp
+++ b/polybench/gramschmidt.cpp
@@ -16,8 +16,8 @@ class Gramschmidt2;
 class Gramschmidt3;

 void init_array(DATA_TYPE* A, size_t size) {
-       const auto M = 0;
-       const auto N = 0;
+       const auto M = size;
+       const auto N = size;

        for(size_t i = 0; i < M; i++) {
                for(size_t j = 0; j < N; j++) {
@@ -27,8 +27,8 @@ void init_array(DATA_TYPE* A, size_t size) {
 }

 void gramschmidt(DATA_TYPE* A, DATA_TYPE* R, DATA_TYPE* Q, size_t size) {
-       const auto M = 0;
-       const auto N = 0;
+       const auto M = size;
+       const auto N = size;

There are bugs in the Mvt benchmark:

--- a/polybench/mvt.cpp
+++ b/polybench/mvt.cpp
@@ -19,8 +19,8 @@ void init_arrays(DATA_TYPE* a, DATA_TYPE* x1, DATA_TYPE* x2, DATA_TYPE* y_1, DAT
        for(size_t i = 0; i < N; i++) {
                x1[i] = 0.0;
                x2[i] = 0.0;
-               y_1[i] = 0.0;
-               y_2[i] = 0.0;
+               y_1[i] = 1.0;
+               y_2[i] = 1.0;

                for(size_t j = 0; j < N; j++) {
                        a[i * N + j] = (DATA_TYPE)(i + j + 1.0) / N;
@@ -102,6 +102,10 @@ class Polybench_Mvt {
                std::vector<DATA_TYPE> x1_cpu(size);
                std::vector<DATA_TYPE> x2_cpu(size);

+               // Trigger writeback
+               x1_buffer.reset();
+               x2_buffer.reset();
+
                init_arrays(a.data(), x1_cpu.data(), x2_cpu.data(), y1.data(), y2.data(), size);

By fixing the bugs, SYCL-MLIR can no longer verify for these benchmarks. Created https://github.com/bcosenza/sycl-bench/pull/50 to fix the bugs found for gramschmidt and mvt.

By modifying Gesummv initial values of array A:

--- a/polybench/gesummv.cpp
+++ b/polybench/gesummv.cpp
@@ -33,7 +33,7 @@ void init(DATA_TYPE* A, DATA_TYPE* B, DATA_TYPE* x, size_t size) {
                x[i] = 1;

                for(size_t j = 0; j < N; j++) {
-                       A[i * N + j] = 2;
+                       A[i * N + j] = i;
                        B[i * N + j] = 3;

It exposes bugs in SYCL-MLIR, and it can no longer be verified.

etiotto commented 1 year ago

@whitneywhtsang in this post https://github.com/intel/llvm/issues/8204#issuecomment-1423590224 you mentioned that the bmks have bugs. Can you pls explain the reason the bmk source code is not correct?

I can see that in the gramschmidt bmk the loop nest below did not execute any iteration. You proposed change would cause the loop nest to execute size*size iterations (unless the inner loop contains early exists). Does the proposed code work (bmk verifies) when using the stock sycl clang compiler ? How do you know that M and N should be initialized to size ?

--- a/polybench/gramschmidt.cpp
+++ b/polybench/gramschmidt.cpp
@@ -16,8 +16,8 @@ class Gramschmidt2;
 class Gramschmidt3;

 void init_array(DATA_TYPE* A, size_t size) {
-       const auto M = 0;
-       const auto N = 0;
+       const auto M = size;
+       const auto N = size;

        for(size_t i = 0; i < M; i++) {
                for(size_t j = 0; j < N; j++) {
@@ -27,8 +27,8 @@ void init_array(DATA_TYPE* A, size_t size) {
 }

I do not understand the why the other 2 bmks (Mvt, and Gesummv) have bugs (from your msg). Can you please explain it ?

whitneywhtsang commented 1 year ago

@whitneywhtsang in this post https://github.com/intel/llvm/issues/8204#issuecomment-1423590224 you mentioned that the bmks have bugs. Can you pls explain the reason the bmk source code is not correct?

For gramschmidt, without my suggested change, then A is actually not initialized, as the loop nest is never entered. It happens that element of A are all zeros. In the kernels, when we do multiplication, reductions, etc on zeros, which will give us zeros. In the cpu code gramschmidt, again loop nests are not entered, and element of A continue to be zeros. End up we always verify even if we do nothing at all in the kernels.

How do you know that M and N should be initialized to size ?

I know from the kernels, e.g., cgh.parallel_for<Gramschmidt3>(range<2>(size, 1), [=, M_ = size, N_ = size](.

I do not understand the why the other 2 bmks (Mvt, and Gesummv) have bugs (from your msg). Can you please explain it ?

For mvt, in the kernels x1 and x2 are calculated as x1[i] += a[{i, j}] * y1[j]; and x2[k] += a[{k, l}] * y2[l];, if elements of y1 and y2 are always zeros, then elements of x1 and x2 are always zeros too. End up we always verify even if we do nothing at all in the kernels.

For gesummv, it is not a bug, but making the initial values of array A a bit more complex, SYCL-MLIR would not verify. (only the first element is correctly calculated.)

whitneywhtsang commented 1 year ago

With https://github.com/intel/llvm/pull/8290 and the following changes to SYCLMethodToSYCLCall.cpp, gesummv, framshmidt, and mvt can all be verified again even after the benchmark changes described in https://github.com/intel/llvm/issues/8204#issuecomment-1423590224.

--- a/mlir-sycl/lib/Transforms/SYCLMethodToSYCLCall.cpp
+++ b/mlir-sycl/lib/Transforms/SYCLMethodToSYCLCall.cpp
@@ -69,15 +69,15 @@ static mlir::Value adaptArgumentForSYCLCall(OpBuilder &Rewriter,

   auto Alloca = static_cast<Value>(CreateAlloca(Rewriter));

-  // Store the element
-  Rewriter.create<memref::StoreOp>(
-      Loc, Original, Alloca,
-      ValueRange{Rewriter.createOrFold<arith::ConstantIndexOp>(Loc, 0)});
-
-  // Cast the memref value to the expected shape
-  Alloca = Rewriter.createOrFold<memref::CastOp>(
-      Loc, MemRefType::get(TargetShape, ThisType), Alloca);
-
+  if (auto Load = Original.getDefiningOp<memref::LoadOp>())
+    Alloca = Load.getOperand(0);
+  else
+    // Store the element
+    Rewriter.create<memref::StoreOp>(
+        Loc, Original, Alloca,
+        ValueRange{Rewriter.createOrFold<arith::ConstantIndexOp>(Loc, 0)});
+
+  auto CurShape = Alloca.getType().cast<MemRefType>().getShape();
   if (Alloca.getType().cast<MemRefType>().getMemorySpaceAsInt() !=
@@ -86,9 +86,14 @@ static mlir::Value adaptArgumentForSYCLCall(OpBuilder &Rewriter,
     Alloca = Rewriter.create<LLVM::AddrSpaceCastOp>(
         Loc, LLVM::LLVMPointerType::get(ThisType, TargetMemSpace), Alloca);
     Alloca = Rewriter.create<polygeist::Pointer2MemrefOp>(
+        Loc, MemRefType::get(CurShape, ThisType, {}, TargetMemSpace), Alloca);
+  }
+
+  // Cast the memref value to the expected shape
+  if (CurShape != TargetShape)
+    Alloca = Rewriter.createOrFold<memref::CastOp>(
         Loc, MemRefType::get(TargetShape, ThisType, {}, TargetMemSpace),
         Alloca);
-  }

   if (Alloca.getType() == MT) {

Note that the above changes to SYCLMethodToSYCLCall.cpp are error prone. In the next meeting, I will explain the bug, and we can discuss a proper fix for it.

whitneywhtsang commented 1 year ago

With https://github.com/intel/llvm/pull/8290 and the changes described in https://github.com/intel/llvm/issues/8204#issuecomment-1426614138, there are more benchmarks that can be verified. Performance measurements on Intel(R) Iris(R) Xe Graphics: Screenshot (15) SYCL-MLIR and DPC++ are pretty much on par for the above benchmarks at default optimization level.

Screenshot (13)

whitneywhtsang commented 1 year ago

With the changes described in https://github.com/intel/llvm/issues/8204#issuecomment-1426614138, there are 3 extra test suite tests pass.

  SYCL :: Basic/buffer/subbuffer.cpp
  SYCL :: Basic/host-task-dependency.cpp
  SYCL :: KernelFusion/ranged_offset_accessor.cpp

With https://github.com/intel/llvm/pull/8290 as well, there are 9 extra test suite tests pass, 12 in total. OCL x64:

  SYCL :: Basic/access_to_subset.cpp
  SYCL :: Basic/buffer/buffer_full_copy.cpp
  SYCL :: Basic/buffer/reinterpret.cpp
  SYCL :: Basic/buffer/subbuffer.cpp
  SYCL :: Basic/host-task-dependency.cpp
  SYCL :: Basic/linear-sub_group.cpp
  SYCL :: Basic/offset-accessor-get_pointer.cpp
  SYCL :: HostInteropTask/host-task-two-queues.cpp
  SYCL :: KernelFusion/ranged_offset_accessor.cpp
  SYCL :: KernelFusion/three_dimensional.cpp
  SYCL :: KernelFusion/two_dimensional.cpp
  SYCL :: SubGroup/sub_groups_sycl2020.cpp

OCL GEN9:

  SYCL :: Basic/access_to_subset.cpp
  SYCL :: Basic/buffer/buffer_full_copy.cpp
  SYCL :: Basic/host-task-dependency.cpp
  SYCL :: Basic/linear-sub_group.cpp
  SYCL :: Basic/offset-accessor-get_pointer.cpp
  SYCL :: KernelFusion/ranged_offset_accessor.cpp
  SYCL :: KernelFusion/three_dimensional.cpp
  SYCL :: KernelFusion/two_dimensional.cpp

L0 GEN9:

  SYCL :: Basic/access_to_subset.cpp
  SYCL :: Basic/buffer/buffer_full_copy.cpp
  SYCL :: Basic/buffer/subbuffer.cpp
  SYCL :: Basic/host-task-dependency.cpp
  SYCL :: Basic/linear-sub_group.cpp
  SYCL :: Basic/offset-accessor-get_pointer.cpp
  SYCL :: KernelFusion/ranged_offset_accessor.cpp
  SYCL :: KernelFusion/three_dimensional.cpp
  SYCL :: KernelFusion/two_dimensional.cpp
  SYCL :: Plugin/level_zero_track_indirect_access_memory.cpp
etiotto commented 1 year ago

Great progress team! The results described in https://github.com/intel/llvm/issues/8204#issuecomment-1426845743 are a good indication that the MLIR based device compiler is on par (performance wise) with the default clang compiler. Next step is to investigate these bmks to see if there are performance opportunities.

whitneywhtsang commented 1 year ago

Created 3 branches in https://github.com/whitneywhtsang/sycl-bench:

  1. sycl-mlir: used as baseline, contains fixes in https://github.com/intel/llvm/issues/8204#issuecomment-1423590224 and run infrastructure.
  2. scalar_replacement: manually applied scalar replacement on the 13 successful benchmarks.
  3. all_opts: manually applied any other optimizations (e.g., kernel fusion) on top of scalar_replacement on the 13 successful benchmarks.

There are good performance gain on 7 benchmarks with scalar replacement, and additional gain on 2 benchmarks with other optimizations. (mvt gain due to kernel fusion) Screenshot (4) Note: It is not always beneficial to apply kernel fusion, e.g., 3mm degraded 22% with kernel fusion.

whitneywhtsang commented 1 year ago

Below are the performance gains from doing "scalar replacement on reduction loops" on a modified version of SYCL-Bench : image Small modifications are done on the SYCL-Bench source code: https://github.com/whitneywhtsang/sycl-bench/tree/workaround (more work require to perform the optimization on the original SYCL-Bench) measured on Intel(R) Iris(R) Xe Graphics with SYCL-MLIR (https://github.com/intel/llvm/commit/e8ae475e37987122e8543b92fb49de66769d5ea2).

etiotto commented 1 year ago

To expand on https://github.com/intel/llvm/issues/8204#issuecomment-1462836338 the small source code change required to get the opportunities entails copying the loop upper bound captured in the lambda into a local variable at the top of the kernel.

This is necessary to allow polygeist transformations that change a scf.while loop into a scf.for loop (and then eventually into a affine.for loop). In order to avoid manual source code changes we will have to develop a transformation that can 'peel' a member from the struct passed to the kernel function ... so the saga continues 😄

whitneywhtsang commented 1 year ago

Status update: measured the branch with the workaround again with SYCL-MLIR (https://github.com/intel/llvm/commit/39bd89c6a4705b9b9d3e227ab7d6543286cec5d3), no degradation.

whitneywhtsang commented 1 year ago

Status update: measured the branch with the workaround again with SYCL-MLIR (https://github.com/intel/llvm/commit/57ecdeba9d0ffdf158984911fde4756eeaf96b30), no degradation. measured the branch without the workaround with SYCL-MLIR (https://github.com/intel/llvm/commit/57ecdeba9d0ffdf158984911fde4756eeaf96b30), we get the gains mentioned in https://github.com/intel/llvm/issues/8204#issuecomment-1462836338 except covariance and gramschmidt.

With https://github.com/intel/llvm/pull/8923, we are able to get the gains for covariance and gramschmidt as well.

whitneywhtsang commented 1 year ago

Status update: measured with SYCL-MLIR (b5a387d3346f008c9df8533e3c0e83d9eafb832f), no regression from manual source code modification. The machine used for performance measurement is reformatted, we can no longer claim 16% performance gain on syr2k even with manual source code modifications, syr2k now fluctuates. Below are the performance gain on the reformatted machine:

SYCL-Bench Performance gain
2mm 12%
3mm 13%
covariance 50%
gemm 12%
gramschmidt 14%
syrk 5%
whitneywhtsang commented 1 year ago

Status update: Checked with https://github.com/intel/llvm/pull/9030, there is one more benchmark blocked_transform able to compile and run. Performance is similar to DPC++. Checked that all other benchmarks still either fail at compile time or runtime, or fail to verify. Here is the updated list of successful benchmarks: 2DConvolution 2mm 3mm atax bicg blocked_transform covariance gemm gesummv gramschmidt matmulchain mvt syr2k syrk

victor-eds commented 1 year ago

@whitneywhtsang should we close this?

whitneywhtsang commented 1 year ago

@whitneywhtsang should we close this?

Sure. Verified that there are no functional regressions from DPC++ on both CPU and GPU.