unisa-hpc / sycl-bench

SYCL Benchmark Suite
BSD 3-Clause "New" or "Revised" License
56 stars 31 forks source link

Issue with --local command line parameter. #49

Open DitiD opened 2 years ago

DitiD commented 2 years ago

While executing the test case blocked_transform which is present under runtime (https://github.com/bcosenza/sycl-bench/blob/master/runtime/blocked_transform.cpp), we noticed that we are getting a core dump error.

Command used to execute - ./blocked_transform --device=gpu

Output -

** Results for Runtime_BlockedTransform_iter_64_blocksize_0** problem-size: 3072 local-size: 1024 device-name: NVIDIA RTX A4000 sycl-implementation: LLVM CUDA (Codeplay) blocked_transform: /tmp/llvm-sycl-nightly-20220222/sycl/source/detail/scheduler/commands.cpp:1826: void cl::sycl::detail::adjustNDRangePerKernel(cl::sycl::detail::NDRDescT&, cl::sycl::detail::pi::PiKernel, const cl::sycl::detail::device_impl&): Assertion `NDR.NumWorkGroups[0] != 0 && NDR.LocalSize[0] == 0' failed. Aborted (core dumped)

However, when we are explicitly assigning the value of the --local parameter to 256 (which is the default value) during runtime, it is executing without any errors.

Command used to execute - ./blocked_transform --device=gpu --local=256

We would like to know if there is a fix for this issue? If so, where can we get the revised code?

DitiD commented 2 years ago

Hi, we would like to know if there are any solutions for the above mentioned issue. Thanks.

illuhad commented 2 years ago

I cannot reproduce the issue with hipSYCL. Your output indicates that somehow a block size of 0 enters the benchmark. This value is derived from the local size. I had a quick look at the code paths, and I don't understand how this could happen - it does not for me. There's an assert that checks that the block size is non-zero. Can you check what happens when compiling with debug assertions enabled?

DitiD commented 2 years ago

Hi, we are not working with hipSYCL. The issue that we are facing is occurring during runtime. The test case is failing to execute when we are not passing the local parameter (as in, when it is taking the value of local parameter as 256 by default).

Command being used to execute - ./blocked_transform --device=gpu

However, it is working fine when we are explicitly defining the local parameter to 256 during runtime.

Command being used to execute - ./blocked_transform --device=gpu --local=256

We are not sure as to why this issue is occurring.

Thanks.

illuhad commented 2 years ago

Hi, we are not working with hipSYCL. The issue that we are facing is occurring during runtime.

I'm aware of this. But I don't have an installation of the DPC++ SYCL implementation with CUDA backend here. I'm just saying I cannot reproduce this with my setup. And I don't understand why DPC++ or hipSYCL would behave differently here anyway. The error does not seem to be related to SYCL specific functionality.

The test case is failing to execute when we are not passing the local parameter (as in, when it is taking the value of local parameter as 256 by default). Command being used to execute - ./blocked_transform --device=gpu However, it is working fine when we are explicitly defining the local parameter to 256 during runtime. Command being used to execute - ./blocked_transform --device=gpu --local=256

I understood this. As I've said I cannot reproduce here. Command line option handling is the same for DPC++ and hipSYCL. For further investigation into the issue, I asked you the following:

There's an assert that checks that the block size is non-zero. Can you check what happens when compiling with debug assertions enabled?

i.e. make sure that the NDEBUG macro is not set when building.

DitiD commented 2 years ago

Hi, as suggested, I've added the following in the blocked_transform.cpp code and I've rebuilt it again.

include

define NDEBUG

It seems that by default, the value of local size is being taken as 1024 (please see attached screenshot below).

Capture

However, when I am defining '--local' to be either 256 (default value) or 1024 explicitly, it is working fine.

Command being used:

./blocked_transform --device=gpu --local=256 ./blocked_transform --device=gpu --local=1024

Could this be a bug in the code? Thanks.

DitiD commented 1 year ago

Hi, is there any update regarding this issue? Thanks.