Closed whitneywhtsang closed 1 year ago
Performance measurements on Intel(R) Iris(R) Xe Graphics:
Code generated by SYCL-MLIR are incorrect for the above benchmarks!
There are bugs in the Gramshmidt benchmark:
--- a/polybench/gramschmidt.cpp
+++ b/polybench/gramschmidt.cpp
@@ -16,8 +16,8 @@ class Gramschmidt2;
class Gramschmidt3;
void init_array(DATA_TYPE* A, size_t size) {
- const auto M = 0;
- const auto N = 0;
+ const auto M = size;
+ const auto N = size;
for(size_t i = 0; i < M; i++) {
for(size_t j = 0; j < N; j++) {
@@ -27,8 +27,8 @@ void init_array(DATA_TYPE* A, size_t size) {
}
void gramschmidt(DATA_TYPE* A, DATA_TYPE* R, DATA_TYPE* Q, size_t size) {
- const auto M = 0;
- const auto N = 0;
+ const auto M = size;
+ const auto N = size;
There are bugs in the Mvt benchmark:
--- a/polybench/mvt.cpp
+++ b/polybench/mvt.cpp
@@ -19,8 +19,8 @@ void init_arrays(DATA_TYPE* a, DATA_TYPE* x1, DATA_TYPE* x2, DATA_TYPE* y_1, DAT
for(size_t i = 0; i < N; i++) {
x1[i] = 0.0;
x2[i] = 0.0;
- y_1[i] = 0.0;
- y_2[i] = 0.0;
+ y_1[i] = 1.0;
+ y_2[i] = 1.0;
for(size_t j = 0; j < N; j++) {
a[i * N + j] = (DATA_TYPE)(i + j + 1.0) / N;
@@ -102,6 +102,10 @@ class Polybench_Mvt {
std::vector<DATA_TYPE> x1_cpu(size);
std::vector<DATA_TYPE> x2_cpu(size);
+ // Trigger writeback
+ x1_buffer.reset();
+ x2_buffer.reset();
+
init_arrays(a.data(), x1_cpu.data(), x2_cpu.data(), y1.data(), y2.data(), size);
By fixing the bugs, SYCL-MLIR can no longer verify for these benchmarks.
Created https://github.com/bcosenza/sycl-bench/pull/50 to fix the bugs found for gramschmidt
and mvt
.
By modifying Gesummv initial values of array A:
--- a/polybench/gesummv.cpp
+++ b/polybench/gesummv.cpp
@@ -33,7 +33,7 @@ void init(DATA_TYPE* A, DATA_TYPE* B, DATA_TYPE* x, size_t size) {
x[i] = 1;
for(size_t j = 0; j < N; j++) {
- A[i * N + j] = 2;
+ A[i * N + j] = i;
B[i * N + j] = 3;
It exposes bugs in SYCL-MLIR, and it can no longer be verified.
@whitneywhtsang in this post https://github.com/intel/llvm/issues/8204#issuecomment-1423590224 you mentioned that the bmks have bugs. Can you pls explain the reason the bmk source code is not correct?
I can see that in the gramschmidt
bmk the loop nest below did not execute any iteration. You proposed change would cause the loop nest to execute size*size
iterations (unless the inner loop contains early exists). Does the proposed code work (bmk verifies) when using the stock sycl clang compiler ? How do you know that M and N should be initialized to size ?
--- a/polybench/gramschmidt.cpp
+++ b/polybench/gramschmidt.cpp
@@ -16,8 +16,8 @@ class Gramschmidt2;
class Gramschmidt3;
void init_array(DATA_TYPE* A, size_t size) {
- const auto M = 0;
- const auto N = 0;
+ const auto M = size;
+ const auto N = size;
for(size_t i = 0; i < M; i++) {
for(size_t j = 0; j < N; j++) {
@@ -27,8 +27,8 @@ void init_array(DATA_TYPE* A, size_t size) {
}
I do not understand the why the other 2 bmks (Mvt, and Gesummv) have bugs (from your msg). Can you please explain it ?
@whitneywhtsang in this post https://github.com/intel/llvm/issues/8204#issuecomment-1423590224 you mentioned that the bmks have bugs. Can you pls explain the reason the bmk source code is not correct?
For gramschmidt
, without my suggested change, then A
is actually not initialized, as the loop nest is never entered. It happens that element of A
are all zeros. In the kernels, when we do multiplication, reductions, etc on zeros, which will give us zeros. In the cpu code gramschmidt
, again loop nests are not entered, and element of A
continue to be zeros. End up we always verify even if we do nothing at all in the kernels.
How do you know that M and N should be initialized to size ?
I know from the kernels, e.g., cgh.parallel_for<Gramschmidt3>(range<2>(size, 1), [=, M_ = size, N_ = size](
.
I do not understand the why the other 2 bmks (Mvt, and Gesummv) have bugs (from your msg). Can you please explain it ?
For mvt
, in the kernels x1
and x2
are calculated as x1[i] += a[{i, j}] * y1[j];
and x2[k] += a[{k, l}] * y2[l];
, if elements of y1
and y2
are always zeros, then elements of x1
and x2
are always zeros too. End up we always verify even if we do nothing at all in the kernels.
For gesummv
, it is not a bug, but making the initial values of array A a bit more complex, SYCL-MLIR
would not verify. (only the first element is correctly calculated.)
With https://github.com/intel/llvm/pull/8290 and the following changes to SYCLMethodToSYCLCall.cpp
, gesummv
, framshmidt
, and mvt
can all be verified again even after the benchmark changes described in https://github.com/intel/llvm/issues/8204#issuecomment-1423590224.
--- a/mlir-sycl/lib/Transforms/SYCLMethodToSYCLCall.cpp
+++ b/mlir-sycl/lib/Transforms/SYCLMethodToSYCLCall.cpp
@@ -69,15 +69,15 @@ static mlir::Value adaptArgumentForSYCLCall(OpBuilder &Rewriter,
auto Alloca = static_cast<Value>(CreateAlloca(Rewriter));
- // Store the element
- Rewriter.create<memref::StoreOp>(
- Loc, Original, Alloca,
- ValueRange{Rewriter.createOrFold<arith::ConstantIndexOp>(Loc, 0)});
-
- // Cast the memref value to the expected shape
- Alloca = Rewriter.createOrFold<memref::CastOp>(
- Loc, MemRefType::get(TargetShape, ThisType), Alloca);
-
+ if (auto Load = Original.getDefiningOp<memref::LoadOp>())
+ Alloca = Load.getOperand(0);
+ else
+ // Store the element
+ Rewriter.create<memref::StoreOp>(
+ Loc, Original, Alloca,
+ ValueRange{Rewriter.createOrFold<arith::ConstantIndexOp>(Loc, 0)});
+
+ auto CurShape = Alloca.getType().cast<MemRefType>().getShape();
if (Alloca.getType().cast<MemRefType>().getMemorySpaceAsInt() !=
@@ -86,9 +86,14 @@ static mlir::Value adaptArgumentForSYCLCall(OpBuilder &Rewriter,
Alloca = Rewriter.create<LLVM::AddrSpaceCastOp>(
Loc, LLVM::LLVMPointerType::get(ThisType, TargetMemSpace), Alloca);
Alloca = Rewriter.create<polygeist::Pointer2MemrefOp>(
+ Loc, MemRefType::get(CurShape, ThisType, {}, TargetMemSpace), Alloca);
+ }
+
+ // Cast the memref value to the expected shape
+ if (CurShape != TargetShape)
+ Alloca = Rewriter.createOrFold<memref::CastOp>(
Loc, MemRefType::get(TargetShape, ThisType, {}, TargetMemSpace),
Alloca);
- }
if (Alloca.getType() == MT) {
Note that the above changes to SYCLMethodToSYCLCall.cpp
are error prone.
In the next meeting, I will explain the bug, and we can discuss a proper fix for it.
With https://github.com/intel/llvm/pull/8290 and the changes described in https://github.com/intel/llvm/issues/8204#issuecomment-1426614138, there are more benchmarks that can be verified.
Performance measurements on Intel(R) Iris(R) Xe Graphics:
SYCL-MLIR
and DPC++
are pretty much on par for the above benchmarks at default optimization level.
With the changes described in https://github.com/intel/llvm/issues/8204#issuecomment-1426614138, there are 3 extra test suite tests pass.
SYCL :: Basic/buffer/subbuffer.cpp
SYCL :: Basic/host-task-dependency.cpp
SYCL :: KernelFusion/ranged_offset_accessor.cpp
With https://github.com/intel/llvm/pull/8290 as well, there are 9 extra test suite tests pass, 12 in total. OCL x64:
SYCL :: Basic/access_to_subset.cpp
SYCL :: Basic/buffer/buffer_full_copy.cpp
SYCL :: Basic/buffer/reinterpret.cpp
SYCL :: Basic/buffer/subbuffer.cpp
SYCL :: Basic/host-task-dependency.cpp
SYCL :: Basic/linear-sub_group.cpp
SYCL :: Basic/offset-accessor-get_pointer.cpp
SYCL :: HostInteropTask/host-task-two-queues.cpp
SYCL :: KernelFusion/ranged_offset_accessor.cpp
SYCL :: KernelFusion/three_dimensional.cpp
SYCL :: KernelFusion/two_dimensional.cpp
SYCL :: SubGroup/sub_groups_sycl2020.cpp
OCL GEN9:
SYCL :: Basic/access_to_subset.cpp
SYCL :: Basic/buffer/buffer_full_copy.cpp
SYCL :: Basic/host-task-dependency.cpp
SYCL :: Basic/linear-sub_group.cpp
SYCL :: Basic/offset-accessor-get_pointer.cpp
SYCL :: KernelFusion/ranged_offset_accessor.cpp
SYCL :: KernelFusion/three_dimensional.cpp
SYCL :: KernelFusion/two_dimensional.cpp
L0 GEN9:
SYCL :: Basic/access_to_subset.cpp
SYCL :: Basic/buffer/buffer_full_copy.cpp
SYCL :: Basic/buffer/subbuffer.cpp
SYCL :: Basic/host-task-dependency.cpp
SYCL :: Basic/linear-sub_group.cpp
SYCL :: Basic/offset-accessor-get_pointer.cpp
SYCL :: KernelFusion/ranged_offset_accessor.cpp
SYCL :: KernelFusion/three_dimensional.cpp
SYCL :: KernelFusion/two_dimensional.cpp
SYCL :: Plugin/level_zero_track_indirect_access_memory.cpp
Great progress team! The results described in https://github.com/intel/llvm/issues/8204#issuecomment-1426845743 are a good indication that the MLIR based device compiler is on par (performance wise) with the default clang compiler. Next step is to investigate these bmks to see if there are performance opportunities.
Created 3 branches in https://github.com/whitneywhtsang/sycl-bench:
sycl-mlir
: used as baseline, contains fixes in https://github.com/intel/llvm/issues/8204#issuecomment-1423590224 and run infrastructure. scalar_replacement
: manually applied scalar replacement on the 13 successful benchmarks.all_opts
: manually applied any other optimizations (e.g., kernel fusion) on top of scalar_replacement
on the 13 successful benchmarks.There are good performance gain on 7 benchmarks with scalar replacement, and additional gain on 2 benchmarks with other optimizations. (mvt gain due to kernel fusion) Note: It is not always beneficial to apply kernel fusion, e.g., 3mm degraded 22% with kernel fusion.
Below are the performance gains from doing "scalar replacement on reduction loops" on a modified version of SYCL-Bench : Small modifications are done on the SYCL-Bench source code: https://github.com/whitneywhtsang/sycl-bench/tree/workaround (more work require to perform the optimization on the original SYCL-Bench) measured on Intel(R) Iris(R) Xe Graphics with SYCL-MLIR (https://github.com/intel/llvm/commit/e8ae475e37987122e8543b92fb49de66769d5ea2).
To expand on https://github.com/intel/llvm/issues/8204#issuecomment-1462836338 the small source code change required to get the opportunities entails copying the loop upper bound captured in the lambda into a local variable at the top of the kernel.
This is necessary to allow polygeist transformations that change a scf.while loop into a scf.for loop (and then eventually into a affine.for loop). In order to avoid manual source code changes we will have to develop a transformation that can 'peel' a member from the struct passed to the kernel function ... so the saga continues 😄
Status update: measured the branch with the workaround again with SYCL-MLIR (https://github.com/intel/llvm/commit/39bd89c6a4705b9b9d3e227ab7d6543286cec5d3), no degradation.
Status update:
measured the branch with the workaround again with SYCL-MLIR (https://github.com/intel/llvm/commit/57ecdeba9d0ffdf158984911fde4756eeaf96b30), no degradation.
measured the branch without the workaround with SYCL-MLIR (https://github.com/intel/llvm/commit/57ecdeba9d0ffdf158984911fde4756eeaf96b30), we get the gains mentioned in https://github.com/intel/llvm/issues/8204#issuecomment-1462836338 except covariance
and gramschmidt
.
With https://github.com/intel/llvm/pull/8923, we are able to get the gains for covariance
and gramschmidt
as well.
Status update: measured with SYCL-MLIR (b5a387d3346f008c9df8533e3c0e83d9eafb832f), no regression from manual source code modification.
The machine used for performance measurement is reformatted, we can no longer claim 16% performance gain on syr2k
even with manual source code modifications, syr2k
now fluctuates. Below are the performance gain on the reformatted machine:
SYCL-Bench | Performance gain |
---|---|
2mm | 12% |
3mm | 13% |
covariance | 50% |
gemm | 12% |
gramschmidt | 14% |
syrk | 5% |
Status update: Checked with https://github.com/intel/llvm/pull/9030, there is one more benchmark blocked_transform
able to compile and run. Performance is similar to DPC++. Checked that all other benchmarks still either fail at compile time or runtime, or fail to verify.
Here is the updated list of successful benchmarks:
2DConvolution 2mm 3mm atax bicg blocked_transform covariance gemm gesummv gramschmidt matmulchain mvt syr2k syrk
@whitneywhtsang should we close this?
@whitneywhtsang should we close this?
Sure. Verified that there are no functional regressions from DPC++ on both CPU and GPU.
This issue is to fix the failures on SYCL-Bench using SYCL-MLIR compiler, and measure performance.
Repository: https://github.com/bcosenza/sycl-bench
To build the benchmarks:
Changes made to
CMakeLists.txt
to use SYCL-MLIR compiler:Below are the benchmarks that have compile failures with SYCL-MLIR compiler:
To run the successful benchmarks:
Below are the benchemarks that have runtime failures with either SYCL-MLIR or both DPC++ and SYCL-MLIR compiler: