For platform compatible, we didn't use device max work group size to launch kernel, and switch to query specific max work group size for kernel by SYCL API. following is our code example
auto kid = ::sycl::get_kernel_id<KernelClass>();
auto kbundle = ::sycl::get_kernel_bundle<::sycl::bundle_state::executable>(
ctx, {dev}, {kid});
::sycl::kernel k = kbundle.get_kernel(kid);
int max_work_group_size = k.get_info<::sycl::info::kernel_device_specific::work_group_size>(dev);
We found this usage takes much host overhead in application. we measured one kernel CPU performance here, each API name in table maps example code:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">
For platform compatible, we didn't use device max work group size to launch kernel, and switch to query specific max work group size for kernel by SYCL API. following is our code example
We found this usage takes much host overhead in application. we measured one kernel CPU performance here, each API name in table maps example code: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
API | get_kernel_id | get_kernel_bundle | get_kernel | get_info -- | -- | -- | -- | -- time (us) | 0.434 | 42.481 | 4.241 | 1.125