unisa-hpc / sycl-bench

SYCL Benchmark Suite
BSD 3-Clause "New" or "Revised" License
56 stars 31 forks source link

[SYCL 2020] USM - Accessors latency benchmark, and minor updates #62

Closed Luigi-Crisci closed 11 months ago

Luigi-Crisci commented 1 year ago

This PR adds a benchmark to measure the latency of submitting kernels using USM and the Buffer-Accessors API.

Benchmark principle

The benchmark submits several kernels in a for loop (default: 50.000) with linear dependencies between each other. In the USM benchmark, this is enforced by a depends_on call. Each benchmark is repeated both with an out-of-order queue and an in-order queue to check if the SYCL runtime applies some optimization to reduce the scheduling time.

Additional info

Results and discussion

NVIDIA Tesla V100S

DPC++ and AdaptiveCpp output very similar results, with USM having smaller system time than the Buffer-Accessors. The in-order queue slightly reduces system-time on both APIs, with a greater impact on USM than the Accessors one. Interestingly, system time is way higher for Accessors even compared with out-of-order USM, which has an explicit additional call for dependency tracking.

AMD MI100

For AdaptiveCPP results are similar to the NVIDIA ones, with the in-order queue saving up to 50% system time compared to the out-of-order on on USM. However, USM out-of-order system time is surprisingly higher compared to Accessors. On the other hand, DPC++ is EXTREMELY slow when using out-of-order queue + USM, with a system time 5x time higher compared to the other benchmarks. Could there be some issues with the dependency tracking API here? Furthermore, kernel time for kernels submitted through the in-order queue is 2x greater compared to the out-of-order one (further investigation are needed)

Conclusions

TODO

Luigi-Crisci commented 1 year ago

Intel Arc 770

Tested only with DPC++ as AdaptiveCPP does not support SYCL profiling info with the SSCP compilation flow on LevelZero.
Surprisingly system time with USM is lower compared to the Tesla V100S, which is a data center GPU. Accessors still perform worse than USM both with in-order and out-of-order queues, achieving around 4x slowdown.
The in-order queue achieves 1.15x and 1.64x speedup with the Accessor and USM benchmarks respectively.

illuhad commented 1 year ago

Tested only with DPC++ as AdaptiveCPP does not support SYCL profiling info with the SSCP compilation flow

Are you sure SSCP is the problem? We have not implemented profiling for the OpenCL or Level Zero backends, but I'm not aware of any limitations that come from SSCP.