[SYCL 2020] USM - Accessors latency benchmark, and minor updates

Luigi-Crisci commented 1 year ago

This PR adds a benchmark to measure the latency of submitting kernels using USM and the Buffer-Accessors API.

Benchmark principle

The benchmark submits several kernels in a for loop (default: 50.000) with linear dependencies between each other. In the USM benchmark, this is enforced by a depends_on call. Each benchmark is repeated both with an out-of-order queue and an in-order queue to check if the SYCL runtime applies some optimization to reduce the scheduling time.

Additional info

Added an in-order queue to BenchmarkArgs, mapped on the same device as the out-of-order one.
Added the system-time profiling metric: it's calculated as the difference between the run-time and the kernel-time, it includes all the SYCL runtime scheduling stuff + the benchmark run method time

Results and discussion

NVIDIA Tesla V100S

DPC++ and AdaptiveCpp output very similar results, with USM having smaller system time than the Buffer-Accessors. The in-order queue slightly reduces system-time on both APIs, with a greater impact on USM than the Accessors one. Interestingly, system time is way higher for Accessors even compared with out-of-order USM, which has an explicit additional call for dependency tracking.

AMD MI100

For AdaptiveCPP results are similar to the NVIDIA ones, with the in-order queue saving up to 50% system time compared to the out-of-order on on USM. However, USM out-of-order system time is surprisingly higher compared to Accessors. On the other hand, DPC++ is EXTREMELY slow when using out-of-order queue + USM, with a system time 5x time higher compared to the other benchmarks. Could there be some issues with the dependency tracking API here? Furthermore, kernel time for kernels submitted through the in-order queue is 2x greater compared to the out-of-order one (further investigation are needed)

Conclusions

USM has lower scheduling overhead compared to the Buffer-Accessors API
- Always true on NVIDIA
- True only with in-order queue on AMD
In-order queue does not seem to give any scheduling advantage with the Buffer-Accessors API
Explicit dependency tracking on AMD is surprisingly slow

TODO

Testing on Intel

Luigi-Crisci commented 1 year ago

Intel Arc 770

Tested only with DPC++ as AdaptiveCPP does not support SYCL profiling info with the SSCP compilation flow on LevelZero.
Surprisingly system time with USM is lower compared to the Tesla V100S, which is a data center GPU. Accessors still perform worse than USM both with in-order and out-of-order queues, achieving around 4x slowdown.
The in-order queue achieves 1.15x and 1.64x speedup with the Accessor and USM benchmarks respectively.

illuhad commented 1 year ago

Tested only with DPC++ as AdaptiveCPP does not support SYCL profiling info with the SSCP compilation flow

Are you sure SSCP is the problem? We have not implemented profiling for the OpenCL or Level Zero backends, but I'm not aware of any limitations that come from SSCP.

unisa-hpc / sycl-bench