Closed Luigi-Crisci closed 11 months ago
Tested only with DPC++ as AdaptiveCPP does not support SYCL profiling info with the SSCP compilation flow on LevelZero.
Surprisingly system time with USM is lower compared to the Tesla V100S, which is a data center GPU. Accessors still perform worse than USM both with in-order and out-of-order queues, achieving around 4x slowdown.
The in-order queue achieves 1.15x and 1.64x speedup with the Accessor and USM benchmarks respectively.
Tested only with DPC++ as AdaptiveCPP does not support SYCL profiling info with the SSCP compilation flow
Are you sure SSCP is the problem? We have not implemented profiling for the OpenCL or Level Zero backends, but I'm not aware of any limitations that come from SSCP.
This PR adds a benchmark to measure the latency of submitting kernels using USM and the Buffer-Accessors API.
Benchmark principle
The benchmark submits several kernels in a for loop (default: 50.000) with linear dependencies between each other. In the USM benchmark, this is enforced by a
depends_on
call. Each benchmark is repeated both with an out-of-order queue and an in-order queue to check if the SYCL runtime applies some optimization to reduce the scheduling time.Additional info
BenchmarkArgs
, mapped on the same device as the out-of-order one.system-time
profiling metric: it's calculated as the difference between the run-time and the kernel-time, it includes all the SYCL runtime scheduling stuff + the benchmarkrun
method timeResults and discussion
NVIDIA Tesla V100S
DPC++ and AdaptiveCpp output very similar results, with USM having smaller system time than the Buffer-Accessors. The
in-order
queue slightly reduces system-time on both APIs, with a greater impact on USM than the Accessors one. Interestingly, system time is way higher for Accessors even compared with out-of-order USM, which has an explicit additional call for dependency tracking.AMD MI100
For AdaptiveCPP results are similar to the NVIDIA ones, with the in-order queue saving up to 50% system time compared to the out-of-order on on USM. However, USM out-of-order system time is surprisingly higher compared to Accessors. On the other hand, DPC++ is EXTREMELY slow when using out-of-order queue + USM, with a system time 5x time higher compared to the other benchmarks. Could there be some issues with the dependency tracking API here? Furthermore, kernel time for kernels submitted through the in-order queue is 2x greater compared to the out-of-order one (further investigation are needed)
Conclusions
TODO