Add specialization constant convolution benchmark, other minor updates

PeterTh commented 1 year ago

This PR adds a benchmark which measures the performance impact of specialization constants. Note that it also includes minor updates to the build process, and to the way results are printed.

Benchmark Principle

The benchmark runs a basic 2D convolution with a generic 3x3 stencil, of which only 5 points are non-zero.
This seems like a good approximation of a useful and representative use case to me.

In order to better interpret the quality of the results, the following ways to specify weights are included:

fully dynamically (AccessVariants::dynamic_value),
as specialization constants (AccessVariants::spec_const_value), or
statically at compile time (AccessVariants::constexpr_value)

Of these, the expectation would be for option 1. to serve as an upper boundary on execution time, and option 3. to be the lower boundary.

The benchmark is also templated across the data type of the computation, and an inner loop count IL. The latter serves the purpose of varying the overall execution time and arithmetic intensity of the kernel.

Results & Discussion

The results turned out to be somewhat more interesting than initially suspected.
What follows are a subset of the measurements on 2 different architectures (and backends), 15 runs each, medians reported.

RTX 3090, using the DPCPP CUDA backend

3090_int64 3090_fp64

Potential conclusions from this data:

For the CUDA backend, it seems like there is significant optimization potential for the compiler if it knows the values of int64s, but not fp64s.
Specialization constants do not appear to be implemented.
- In fact, for IL1 there is a small performance overhead. This indicates that using specialization constants (which appear to be simply a wrapper around dynamic variables) has a small fixed overhead, the impact of which is diminished for the longer-runnign IL16+ versions.

Intel ARC 770, using the DPCPP Intel backend

770_int64 770_fp32

Potential conclusions from this data:

Clearly, something happens for specialization constants, other than simply wrapping dynamic data.
Non-obvious compiler heuristics appear to be involved: for int64 only IL1 is affected, while for fp32 IL16 is disproportionally optimized by using specialization constants.
- The latter achieves a result superior to offline compilation, which is quite surprising.

Per-run Execution Times

To further support the conclusions drawn from these results, consider these relative per-run execution times: per_run

We observe the following:

On the CUDA backend, the initial run behaves exactly the same as every other run, consistent with offline ahead-of-time compilation.
On the Intel backend, even without using specialization constants, the initial run is slightly longer, perhaps indicating some instantiation / final compilation pass overhead. More importantly, when using specialization constants, there is a large overhead which is independent of the kernel execution time. This is consistent with a compilation infrastructure being spun up and applied.

bcosenza commented 1 year ago

What do you mean by Intel backend? Is it LevelZero? My understanding is that the SPIR-V backend implements specialization constant, while other backends do ahead of time compilation and therefore I assume they cannot. I know that there is some caching machanism happening at plugin level, this may perhaps result in small difference for runs after the first, not sure if this happens here though.

PeterTh commented 1 year ago

It's Level Zero:

device-name: Intel(R) Arc(TM) A770 Graphics
platform-name: Intel(R) Level-Zero
sycl-implementation: LLVM (Intel DPC++)

I did assume the non-specialization-constant bump for the first run is related to some form of caching.

unisa-hpc / sycl-bench