SYCL runtime: Severe host overhead in sycl::get_kernel_bundle

majing921201 commented 1 month ago

For platform compatible, we didn't use device max work group size to launch kernel, and switch to query specific max work group size for kernel by SYCL API. following is our code example

  auto kid = ::sycl::get_kernel_id<KernelClass>();
  auto kbundle = ::sycl::get_kernel_bundle<::sycl::bundle_state::executable>(
      ctx, {dev}, {kid});
  ::sycl::kernel k = kbundle.get_kernel(kid);
  int max_work_group_size =  k.get_info<::sycl::info::kernel_device_specific::work_group_size>(dev);

We found this usage takes much host overhead in application. we measured one kernel CPU performance here, each API name in table maps example code: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

API | get_kernel_id | get_kernel_bundle | get_kernel | get_info -- | -- | -- | -- | -- time (us) | 0.434 | 42.481 | 4.241 | 1.125

We also file internal jira to track this issue. Can you help evaluate this slow performance.

fengyuan14 commented 1 month ago

https://github.com/intel/torch-xpu-ops/issues/1016

AlexeySachkov commented 1 month ago

Hi @majing921201,

We also file internal jira to track this issue. Can you help evaluate this slow performance.

Is your complaint that get_kernel_bundle<executable> is slow, or that your overall program has slowed down? Do you use JIT, or AOT (i.e. do you use -fsycl-targets and what do you pass there?)? Do you pass that kernel bundle you get later into handler::use_kernel_bundle to make sure that it is being re-used by SYCL RT?

To add some background here which will likely be enough for a high-level explanation of this (but not enough to say what exactly happens in your case):

get_kernel_bundle<executable> performs necessary actions to bring device image with kernels you specified up to executable state and that may involve invoking JIT compiler if you are not using AOT. But even with AOT, it is likely that we still have to call some low-level APIs like compile/build program to be able to query the information SYCL RT was asked about.

get_kernel_bundle<executable> is implicitly used under the hood of queue::submit. We do store final executable device image into in-memory cache, so I expect that if you queried a kernel bundle explicitly, then queue::submit should be quicker, because it doesn't need to repeat the said operation anymore, but instead could just grab a result from in-memory cache. However, I'm not familiar enough with SYCL RT to say for sure and there is always risk of some bugs. In any case I expect handler::use_kernel_bundle to be the most performant option in this case, because it should avoid both in-memory cache lookup and repeating device image processing/handling.

majing921201 commented 1 month ago

Is your complaint that get_kernel_bundle is slow, or that your overall program has slowed down? Do you use JIT, or AOT (i.e. do you use -fsycl-targets and what do you pass there?)? Do you pass that kernel bundle you get later into handler::use_kernel_bundle to make sure that it is being re-used by SYCL RT?

We used aot with 'pvc' as target. And we didn't pass kernel boudle to handler::use_kernel_bundle, Our current routine usage follows the guide in an internal jira discussion.

intel / llvm

SYCL runtime: Severe host overhead in sycl::get_kernel_bundle #15824