Closed PeterTh closed 1 year ago
What do you mean by Intel backend? Is it LevelZero? My understanding is that the SPIR-V backend implements specialization constant, while other backends do ahead of time compilation and therefore I assume they cannot. I know that there is some caching machanism happening at plugin level, this may perhaps result in small difference for runs after the first, not sure if this happens here though.
It's Level Zero:
device-name: Intel(R) Arc(TM) A770 Graphics
platform-name: Intel(R) Level-Zero
sycl-implementation: LLVM (Intel DPC++)
I did assume the non-specialization-constant bump for the first run is related to some form of caching.
This PR adds a benchmark which measures the performance impact of specialization constants. Note that it also includes minor updates to the build process, and to the way results are printed.
Benchmark Principle
The benchmark runs a basic 2D convolution with a generic 3x3 stencil, of which only 5 points are non-zero.
This seems like a good approximation of a useful and representative use case to me.
In order to better interpret the quality of the results, the following ways to specify weights are included:
AccessVariants::dynamic_value
),AccessVariants::spec_const_value
), orAccessVariants::constexpr_value
)Of these, the expectation would be for option 1. to serve as an upper boundary on execution time, and option 3. to be the lower boundary.
The benchmark is also templated across the data type of the computation, and an inner loop count
IL
. The latter serves the purpose of varying the overall execution time and arithmetic intensity of the kernel.Results & Discussion
The results turned out to be somewhat more interesting than initially suspected.
What follows are a subset of the measurements on 2 different architectures (and backends), 15 runs each, medians reported.
RTX 3090, using the DPCPP CUDA backend
Potential conclusions from this data:
int64
s, but notfp64
s.IL1
there is a small performance overhead. This indicates that using specialization constants (which appear to be simply a wrapper around dynamic variables) has a small fixed overhead, the impact of which is diminished for the longer-runnignIL16
+ versions.Intel ARC 770, using the DPCPP Intel backend
Potential conclusions from this data:
int64
onlyIL1
is affected, while forfp32
IL16
is disproportionally optimized by using specialization constants.Per-run Execution Times
To further support the conclusions drawn from these results, consider these relative per-run execution times:
We observe the following: