Open huanghua1994 opened 4 years ago
error: use of undeclared identifier 'n_blk' std::cout << "n = " << n << ", b = " << b << ", n_blk = " << n_blk << std::endl;
Oops, the n_blk
bug is fixed now
Following your report, I add what I observe:
Profiling shows there are three kernels for the floating-point reduction and one kernel for the integer reduction.
The profiling APIs measure only one kernel on an Intel iGPU using the OpenCL backend.
On an Nvidia GPU, it is difficult to understand the profiling result compared to the results of nvprofiling
Runtime with reduction = 14303 ns
==57761== Profiling application: ./test 1048576 128
==57761== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 91.76% 130.02us 1 130.02us 130.02us 130.02us _ZTSN2cl4sycl6ONEAPI6detail28__sycl_reduction_main_kernelIZZ4mainENKUlRNS0_7handlerEE 32_30clES5_EUlNS0_7nd_itemILi1EEERT_E37_10Lb1ELb1ENS0_8accessorIfLi0ELNS0_6access4modeE1026ELNSD_6targetE2014ELNSD_11placeholderE0ENS1_22accessor_property_listIJE EEEEEE
5.24% 7.4240us 1 7.4240us 7.4240us 7.4240us _ZTSN2cl4sycl6ONEAPI6detail27__sycl_reduction_aux_kernelIZZ4mainENKUlRNS0_7handlerEE3 2_30clES5_EUlNS0_7nd_itemILi1EEERT_E37_10Lb1ELb1ENS0_8accessorIfLi0ELNS0_6access4modeE1026ELNSD_6targetE2014ELNSD_11placeholderE0ENS1_22accessor_property_listIJEE EEEEE
3.00% 4.2560us 1 4.2560us 4.2560us 4.2560us _ZTSN2cl4sycl6ONEAPI6detail27__sycl_reduction_aux_kernelIZZ4mainENKUlRNS0_7handlerEE3 2_30clES5_EUlNS0_7nd_itemILi1EEERT_E37_10Lb1ELb1EPfEE
API calls: 89.91% 215.31ms 1 215.31ms 215.31ms 215.31ms cuCtxCreate
8.99% 21.520ms 2 10.760ms 371.10us 21.149ms cuMemAllocManaged
0.39% 941.27us 3 313.76us 13.069us 912.82us cuLaunchKernel
0.27% 657.65us 1 657.65us 657.65us 657.65us cuModuleLoadDataEx
0.27% 647.12us 2 323.56us 89.447us 557.67us cuMemFree
0.08% 196.46us 2 98.227us 7.6220us 188.83us cuMemAlloc
0.03% 69.243us 2 34.621us 22.236us 47.007us cuDeviceGetName
0.02% 50.095us 10 5.0090us 2.6110us 9.6370us cuEventRecord
0.01% 16.144us 3 5.3810us 1.5580us 12.565us cuEventSynchronize
0.01% 13.085us 1 13.085us 13.085us 13.085us cuStreamCreate
0.00% 11.241us 10 1.1240us 367ns 4.6120us cuEventCreate
0.00% 10.388us 2 5.1940us 3.1410us 7.2470us cuDeviceGetPCIBusId
0.00% 7.8110us 21 371ns 163ns 653ns cuCtxGetCurrent
0.00% 7.7100us 6 1.2850us 422ns 3.8320us cuModuleGetFunction
0.00% 6.9590us 19 366ns 174ns 1.6670us cuDeviceGetAttribute
0.00% 5.9870us 2 2.9930us 1.6430us 4.3440us cuStreamWaitEvent
0.00% 5.3330us 2 2.6660us 2.2510us 3.0820us cuEventElapsedTime
0.00% 3.8590us 2 1.9290us 606ns 3.2530us cuPointerGetAttribute
0.00% 1.7440us 4 436ns 156ns 785ns cuDeviceGet
0.00% 1.5870us 3 529ns 283ns 877ns cuDeviceGetCount
@huanghua1994 @zjin-lcf @bader
Hi, thanks for posting this. I find that the results of the timing depends a lot on the device used. But in any case, if you try your benchmark using the latest DPC++ commit you may see that the T = float case behaves differently than it did before whereas the T = int case is probably unchanged. This is because the float reduction now also uses atomic operations for adding to the final reduction variable rather than an auxillary kernel. However, as zjin-lcf mentions the reduction timing that you will record still does not correspond with the main reduction kernel. Instead it corresponds to a call to reduSaveFinalResultToUserMem that is made for the USM case but not for the case of using buffers. If you switch to using buffers for your 'data' and 'sum' variables you should see that the reduction event timing gives the same value as the main reduction kernel from nvprof. The general problem encountered here with event timings using the cuda backend appears to be that only the timing for the latest kernel that was enqueued during the event, 'ev1', is recorded, i.e. reduSaveFinalResultToUserMem, rather than the kernel which takes the most time, i.e. sycl_reduction_main_kernel. We are currently investigating this issue.
Thank you for your feedback. Look forward to your updates.
From: JackAKirk @.> Sent: Thursday, August 5, 2021 9:46 AM To: intel/llvm @.> Cc: Jin, Zheming @.>; Mention @.> Subject: [EXTERNAL] Re: [intel/llvm] [SYCL] DPC++ reduction library incorrect event profiling timing (#2820)
@huanghua1994https://github.com/huanghua1994 @zjin-lcfhttps://github.com/zjin-lcf @baderhttps://github.com/bader
Hi, thanks for posting this. I find that the results of the timing depends a lot on the device used. But in any case, if you try your benchmark using the latest DPC++ commit you may see that the T = float case behaves differently than it did before whereas the T = int case is probably unchanged. This is because the float reduction now also uses atomic operations for adding to the final reduction variable rather than an auxillary kernel. However, as zjin-lcf mentions the reduction timing that you will record still does not correspond with the main reduction kernel. Instead it corresponds to a call to reduSaveFinalResultToUserMem that is made for the USM case but not for the case of using buffers. If you switch to using buffers for your 'data' and 'sum' variables you should see that the reduction event timing gives the same value as the main reduction kernel from nvprof. The general problem encountered here with event timings using the cuda backend appears to be that only the timing for the latest kernel that was enqueued during the event, 'ev1', is recorded, i.e. reduSaveFinalResultToUserMem, rather than the kernel which takes the most time, i.e. sycl_reduction_main_kernel. We are currently investigating this issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/intel/llvm/issues/2820#issuecomment-893473160, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANUU4DUS6ZRBYZREILAMKETT3KI2ZANCNFSM4UC5XOIA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.
@v-klochkov FYI
Update: I think it is a reduction specific issue and I expect that it affects all backends, not just cuda. In short the issue is that the submission of the command group for the reduction parallel_for creates more than one event for the USM case and the event, 'ev1', that is returned by the call to submit is the latest event created during the call to submit that corresponds to the call to reduSaveFinalResultToUserMem.
I believe that the resolution for this issue will be discussed internally.
@
Update: I think it is a reduction specific issue and I expect that it affects all backends, not just cuda. In short the issue is that the submission of the command group for the reduction parallel_for creates more than one event for the USM case and the event, 'ev1', that is returned by the call to submit is the latest event created during the call to submit that corresponds to the call to reduSaveFinalResultToUserMem.
I believe that the resolution for this issue will be discussed internally.
@JackAKirk, thanks for the analysis. @v-klochkov, please, take a look at this issue.
The timing/profiling using the events happened to be a bad surprise, it was not taken into account during development of the reduction. Currently the reduction is using more than 1 kernels per 1 parallel_for invocation to get better performance. Fixing this issue may require significant re-work of the reduction implementation.
The timing/profiling using the events happened to be a bad surprise, it was not taken into account during development of the reduction. Currently the reduction is using more than 1 kernels per 1 parallel_for invocation to get better performance. Fixing this issue may require significant re-work of the reduction implementation.
I don't think this is as big a problem as you expect -- I would argue that the necessary heavy lifting (if there is any) will be in the event
class rather than the reduction
class.
Using multiple parallel_for
invocations as an implementation detail behind our reductions is fine, as long as we can provide a sycl::event
object that represents all of those invocations. Waiting on the event returned from a reduction should wait on all kernels submitted to the device, and querying the profiling information for that event should represent submission of the whole command-group. The SYCL 2020 specification explicitly says that a sycl::event
may map to several OpenCL backend events -- there's no guarantee that a single command group maps to a single kernel!
Hi! There have been no updates for at least the last 60 days, though the ticket has assignee(s).
@v-klochkov, could I ask you to take one of the following actions? :)
Thanks!
@steffenlarsen , @aelovikov-intel - please re-assign/dispatch. The reduction implementation has got lots of extra changes after I initially implemented it few years ago and left Scalar SYCL squad.
Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).
@steffenlarsen, could you please take one of the following actions:
Thanks!
Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).
@steffenlarsen, could you please take one of the following actions:
Thanks!
Test file: https://github.com/huanghua1994/HPC_Playground/blob/master/SYCL/reduction_timing.cpp Compiler version: git commit 140c0d04b777d291956a88d59ee47f17c5f448b3 Compiler configuration:
buildbot/configure.py --cuda
Selected device: GTX 1070, CUDA version 11.0, driver version 455.38Problem description: When using the DPC++ reduction library for float type add reduction,
info::event_profiling::command_{start/end}
returned incorrect timings (too small). For int type add reduction, the timings are correct.Sample output when using
T = float
:Sample output when using
T = int
: