unisa-hpc / sycl-bench

SYCL Benchmark Suite
BSD 3-Clause "New" or "Revised" License
56 stars 31 forks source link

Add specialization constant convolution benchmark, other minor updates #59

Closed PeterTh closed 1 year ago

PeterTh commented 1 year ago

This PR adds a benchmark which measures the performance impact of specialization constants. Note that it also includes minor updates to the build process, and to the way results are printed.

Benchmark Principle

The benchmark runs a basic 2D convolution with a generic 3x3 stencil, of which only 5 points are non-zero.
This seems like a good approximation of a useful and representative use case to me.

In order to better interpret the quality of the results, the following ways to specify weights are included:

  1. fully dynamically (AccessVariants::dynamic_value),
  2. as specialization constants (AccessVariants::spec_const_value), or
  3. statically at compile time (AccessVariants::constexpr_value)

Of these, the expectation would be for option 1. to serve as an upper boundary on execution time, and option 3. to be the lower boundary.

The benchmark is also templated across the data type of the computation, and an inner loop count IL. The latter serves the purpose of varying the overall execution time and arithmetic intensity of the kernel.

Results & Discussion

The results turned out to be somewhat more interesting than initially suspected.
What follows are a subset of the measurements on 2 different architectures (and backends), 15 runs each, medians reported.

RTX 3090, using the DPCPP CUDA backend

3090_int64 3090_fp64

Potential conclusions from this data:

Intel ARC 770, using the DPCPP Intel backend

770_int64 770_fp32

Potential conclusions from this data:

Per-run Execution Times

To further support the conclusions drawn from these results, consider these relative per-run execution times: per_run

We observe the following:

bcosenza commented 1 year ago

What do you mean by Intel backend? Is it LevelZero? My understanding is that the SPIR-V backend implements specialization constant, while other backends do ahead of time compilation and therefore I assume they cannot. I know that there is some caching machanism happening at plugin level, this may perhaps result in small difference for runs after the first, not sure if this happens here though.

PeterTh commented 1 year ago

It's Level Zero:

device-name: Intel(R) Arc(TM) A770 Graphics
platform-name: Intel(R) Level-Zero
sycl-implementation: LLVM (Intel DPC++)

I did assume the non-specialization-constant bump for the first run is related to some form of caching.