intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.22k stars 732 forks source link

[SYCL] DPC++ reduction library incorrect event profiling timing #2820

Open huanghua1994 opened 3 years ago

huanghua1994 commented 3 years ago

Test file: https://github.com/huanghua1994/HPC_Playground/blob/master/SYCL/reduction_timing.cpp Compiler version: git commit 140c0d04b777d291956a88d59ee47f17c5f448b3 Compiler configuration: buildbot/configure.py --cuda Selected device: GTX 1070, CUDA version 11.0, driver version 455.38

Problem description: When using the DPC++ reduction library for float type add reduction, info::event_profiling::command_{start/end} returned incorrect timings (too small). For int type add reduction, the timings are correct.

Sample output when using T = float:

$ ./reduction_timing.exe 1048576 128
n = 1048576, b = 128
Runtime with reduction    = 64513 ns
Runtime without reduction = 175105 ns

Sample output when using T = int:

$ ./reduction_timing.exe 1048576 128
n = 1048576, b = 128
Runtime with reduction    = 2096161 ns
Runtime without reduction = 175102 ns
zjin-lcf commented 3 years ago

error: use of undeclared identifier 'n_blk' std::cout << "n = " << n << ", b = " << b << ", n_blk = " << n_blk << std::endl;

huanghua1994 commented 3 years ago

Oops, the n_blk bug is fixed now

zjin-lcf commented 3 years ago

Following your report, I add what I observe:

Profiling shows there are three kernels for the floating-point reduction and one kernel for the integer reduction. The profiling APIs measure only one kernel on an Intel iGPU using the OpenCL backend.
On an Nvidia GPU, it is difficult to understand the profiling result compared to the results of nvprofiling

Runtime with    reduction = 14303 ns
==57761== Profiling application: ./test 1048576 128
==57761== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   91.76%  130.02us         1  130.02us  130.02us  130.02us  _ZTSN2cl4sycl6ONEAPI6detail28__sycl_reduction_main_kernelIZZ4mainENKUlRNS0_7handlerEE                                         32_30clES5_EUlNS0_7nd_itemILi1EEERT_E37_10Lb1ELb1ENS0_8accessorIfLi0ELNS0_6access4modeE1026ELNSD_6targetE2014ELNSD_11placeholderE0ENS1_22accessor_property_listIJE                                         EEEEEE
                    5.24%  7.4240us         1  7.4240us  7.4240us  7.4240us  _ZTSN2cl4sycl6ONEAPI6detail27__sycl_reduction_aux_kernelIZZ4mainENKUlRNS0_7handlerEE3                                         2_30clES5_EUlNS0_7nd_itemILi1EEERT_E37_10Lb1ELb1ENS0_8accessorIfLi0ELNS0_6access4modeE1026ELNSD_6targetE2014ELNSD_11placeholderE0ENS1_22accessor_property_listIJEE                                         EEEEE
                    3.00%  4.2560us         1  4.2560us  4.2560us  4.2560us  _ZTSN2cl4sycl6ONEAPI6detail27__sycl_reduction_aux_kernelIZZ4mainENKUlRNS0_7handlerEE3                                         2_30clES5_EUlNS0_7nd_itemILi1EEERT_E37_10Lb1ELb1EPfEE
      API calls:   89.91%  215.31ms         1  215.31ms  215.31ms  215.31ms  cuCtxCreate
                    8.99%  21.520ms         2  10.760ms  371.10us  21.149ms  cuMemAllocManaged
                    0.39%  941.27us         3  313.76us  13.069us  912.82us  cuLaunchKernel
                    0.27%  657.65us         1  657.65us  657.65us  657.65us  cuModuleLoadDataEx
                    0.27%  647.12us         2  323.56us  89.447us  557.67us  cuMemFree
                    0.08%  196.46us         2  98.227us  7.6220us  188.83us  cuMemAlloc
                    0.03%  69.243us         2  34.621us  22.236us  47.007us  cuDeviceGetName
                    0.02%  50.095us        10  5.0090us  2.6110us  9.6370us  cuEventRecord
                    0.01%  16.144us         3  5.3810us  1.5580us  12.565us  cuEventSynchronize
                    0.01%  13.085us         1  13.085us  13.085us  13.085us  cuStreamCreate
                    0.00%  11.241us        10  1.1240us     367ns  4.6120us  cuEventCreate
                    0.00%  10.388us         2  5.1940us  3.1410us  7.2470us  cuDeviceGetPCIBusId
                    0.00%  7.8110us        21     371ns     163ns     653ns  cuCtxGetCurrent
                    0.00%  7.7100us         6  1.2850us     422ns  3.8320us  cuModuleGetFunction
                    0.00%  6.9590us        19     366ns     174ns  1.6670us  cuDeviceGetAttribute
                    0.00%  5.9870us         2  2.9930us  1.6430us  4.3440us  cuStreamWaitEvent
                    0.00%  5.3330us         2  2.6660us  2.2510us  3.0820us  cuEventElapsedTime
                    0.00%  3.8590us         2  1.9290us     606ns  3.2530us  cuPointerGetAttribute
                    0.00%  1.7440us         4     436ns     156ns     785ns  cuDeviceGet
                    0.00%  1.5870us         3     529ns     283ns     877ns  cuDeviceGetCount
JackAKirk commented 3 years ago

@huanghua1994 @zjin-lcf @bader

Hi, thanks for posting this. I find that the results of the timing depends a lot on the device used. But in any case, if you try your benchmark using the latest DPC++ commit you may see that the T = float case behaves differently than it did before whereas the T = int case is probably unchanged. This is because the float reduction now also uses atomic operations for adding to the final reduction variable rather than an auxillary kernel. However, as zjin-lcf mentions the reduction timing that you will record still does not correspond with the main reduction kernel. Instead it corresponds to a call to reduSaveFinalResultToUserMem that is made for the USM case but not for the case of using buffers. If you switch to using buffers for your 'data' and 'sum' variables you should see that the reduction event timing gives the same value as the main reduction kernel from nvprof. The general problem encountered here with event timings using the cuda backend appears to be that only the timing for the latest kernel that was enqueued during the event, 'ev1', is recorded, i.e. reduSaveFinalResultToUserMem, rather than the kernel which takes the most time, i.e. sycl_reduction_main_kernel. We are currently investigating this issue.

zjin-lcf commented 3 years ago

Thank you for your feedback. Look forward to your updates.

From: JackAKirk @.> Sent: Thursday, August 5, 2021 9:46 AM To: intel/llvm @.> Cc: Jin, Zheming @.>; Mention @.> Subject: [EXTERNAL] Re: [intel/llvm] [SYCL] DPC++ reduction library incorrect event profiling timing (#2820)

@huanghua1994https://github.com/huanghua1994 @zjin-lcfhttps://github.com/zjin-lcf @baderhttps://github.com/bader

Hi, thanks for posting this. I find that the results of the timing depends a lot on the device used. But in any case, if you try your benchmark using the latest DPC++ commit you may see that the T = float case behaves differently than it did before whereas the T = int case is probably unchanged. This is because the float reduction now also uses atomic operations for adding to the final reduction variable rather than an auxillary kernel. However, as zjin-lcf mentions the reduction timing that you will record still does not correspond with the main reduction kernel. Instead it corresponds to a call to reduSaveFinalResultToUserMem that is made for the USM case but not for the case of using buffers. If you switch to using buffers for your 'data' and 'sum' variables you should see that the reduction event timing gives the same value as the main reduction kernel from nvprof. The general problem encountered here with event timings using the cuda backend appears to be that only the timing for the latest kernel that was enqueued during the event, 'ev1', is recorded, i.e. reduSaveFinalResultToUserMem, rather than the kernel which takes the most time, i.e. sycl_reduction_main_kernel. We are currently investigating this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/intel/llvm/issues/2820#issuecomment-893473160, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANUU4DUS6ZRBYZREILAMKETT3KI2ZANCNFSM4UC5XOIA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

romanovvlad commented 3 years ago

@v-klochkov FYI

JackAKirk commented 3 years ago

Update: I think it is a reduction specific issue and I expect that it affects all backends, not just cuda. In short the issue is that the submission of the command group for the reduction parallel_for creates more than one event for the USM case and the event, 'ev1', that is returned by the call to submit is the latest event created during the call to submit that corresponds to the call to reduSaveFinalResultToUserMem.

I believe that the resolution for this issue will be discussed internally.

bader commented 3 years ago

@

Update: I think it is a reduction specific issue and I expect that it affects all backends, not just cuda. In short the issue is that the submission of the command group for the reduction parallel_for creates more than one event for the USM case and the event, 'ev1', that is returned by the call to submit is the latest event created during the call to submit that corresponds to the call to reduSaveFinalResultToUserMem.

I believe that the resolution for this issue will be discussed internally.

@JackAKirk, thanks for the analysis. @v-klochkov, please, take a look at this issue.

v-klochkov commented 3 years ago

The timing/profiling using the events happened to be a bad surprise, it was not taken into account during development of the reduction. Currently the reduction is using more than 1 kernels per 1 parallel_for invocation to get better performance. Fixing this issue may require significant re-work of the reduction implementation.

Pennycook commented 3 years ago

The timing/profiling using the events happened to be a bad surprise, it was not taken into account during development of the reduction. Currently the reduction is using more than 1 kernels per 1 parallel_for invocation to get better performance. Fixing this issue may require significant re-work of the reduction implementation.

I don't think this is as big a problem as you expect -- I would argue that the necessary heavy lifting (if there is any) will be in the event class rather than the reduction class.

Using multiple parallel_for invocations as an implementation detail behind our reductions is fine, as long as we can provide a sycl::event object that represents all of those invocations. Waiting on the event returned from a reduction should wait on all kernels submitted to the device, and querying the profiling information for that event should represent submission of the whole command-group. The SYCL 2020 specification explicitly says that a sycl::event may map to several OpenCL backend events -- there's no guarantee that a single command group maps to a single kernel!

KornevNikita commented 4 months ago

Hi! There have been no updates for at least the last 60 days, though the ticket has assignee(s).

@v-klochkov, could I ask you to take one of the following actions? :)

Thanks!

v-klochkov commented 4 months ago

@steffenlarsen , @aelovikov-intel - please re-assign/dispatch. The reduction implementation has got lots of extra changes after I initially implemented it few years ago and left Scalar SYCL squad.

github-actions[bot] commented 2 months ago

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@steffenlarsen, could you please take one of the following actions:

Thanks!

github-actions[bot] commented 2 weeks ago

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@steffenlarsen, could you please take one of the following actions:

Thanks!