[Level Zero] sycl::parallel_for with ranges larger than INT_MAX deadlocks or aborts

masterleinad commented 2 years ago

Describe the bug Running

#include <iostream>
#include <CL/sycl.hpp>

int main(int, char**) {
   cl::sycl::default_selector device_selector;
   cl::sycl::queue queue(device_selector);
   std::cout << "Running on "
             << queue.get_device().get_info<cl::sycl::info::device::name>()
             << "\n";
   size_t N = INT_MAX; //breaks for CUDA
   // size_t N = 5000000000; // breaks for Intel
   sycl::range<1> range(N+1);
   auto parallel_for_event = queue.submit([&](sycl::handler& cgh) {
     cgh.parallel_for(range, [=](sycl::item<1> /*item*/) {});
   });

   return 0;
}

deadlocks on CUDA devices or gives

C++ exception with description "PI backend failed. PI backend returns: -54 (CL_INVALID_WORK_GROUP_SIZE) -54 (CL_INVALID_WORK_GROUP_SIZE)" thrown in the test body.

on Intel GPUs when compiled and run via

clang++ -fsycl -fsycl-unnamed-lambda -fno-sycl-id-queries-fit-in-int -fsycl-targets=nvptx64-nvidia-cuda-sycldevice && ./a.out

resp.

clang++ -fsycl -fsycl-unnamed-lambda -fno-sycl-id-queries-fit-in-int dummy.cc && ./a.out

Environment:

OS: Linux
Target device and vendor: Intel GPU, NVIDIA GPU
DPC++ version: nightly release 20210621

zjin-lcf commented 2 years ago

Running on Intel(R) UHD Graphics P630 [0x3e96] terminate called after throwing an instance of 'cl::sycl::runtime_error' what(): Provided range is out of integer limits. Pass `-fno-sycl-id-queries-fit-in-int' to disable range check. -30 (CL_INVALID_VALUE)

Is it right that the real issue is sycl::range should not be limited to the range of an integer ? Thanks.

bader commented 2 years ago

Is it right that the real issue is sycl::range should not be limited to the range of an integer ? Thanks.

Yes. It's done for performance reasons and can be relaxed with -fno-sycl-id-queries-fit-in-int flag if needed.

masterleinad commented 2 years ago

As said in the initial post, I was using -fno-sycl-id-queries-fit-in-in already.

masterleinad commented 2 years ago

Enabling

size_t N = 5000000000lu; // breaks for Intel

the test still fails with

Running on Intel(R) Graphics [0x020a]
terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  PI backend failed. PI backend returns: -54 (CL_INVALID_WORK_GROUP_SIZE) -54 (CL_INVALID_WORK_GROUP_SIZE)
Aborted

on Intel GPUs with a nightly build from 10/25.

AerialMantis commented 2 years ago

Now that https://github.com/intel/llvm/pull/5095 is merged this should address the problem for the CUDA backend, so I will remove the CUDA label.

@bader I believe the remaining issue here is with the OpenCL/Level Zero backend.

bader commented 2 years ago

HIP backend fix is not merged yet.

@bader I believe the remaining issue here is with the OpenCL/Level Zero backend.

I think exception with CL_INVALID_WORK_GROUP_SIZE error code might be expected here. Do you think OpenCL/Level Zero should support work size > 5000000000?

AerialMantis commented 2 years ago

I'm not sure about Level Zero, but AFAICT OpenCL doesn't have any limitation to the global work size, the only thing I see is there's the CL_KERNEL_GLOBAL_WORK_SIZE query for clGetKernelWorkGroupInfo, though this is only for custom devices and built-in kernel functions, I believe in OpenCL any global size is expected to work.

Though 5000000000 is larger than the max value of a 32bit unsigned integer so I can see why this could fail.

bader commented 2 years ago

@masterleinad, could you check if OpenCL back-end has such limitation by setting SYCL_DEVICE_FILTER=opencl:gpu, please? I see that Level Zero plug-in is trying to set work-group size by using zeKernelSuggestGroupSize function with global size parameters type - uint32_t i.e. 32-bit integer. So, it looks like although SYCL uses size_t type to represent global work size, Level Zero plugin is able to support global work sizes up to UINT32_MAX. OpenCL back-end is using clEnqueueNDRangeKernel directly, which accepts size_t global work sizes. Potentially it can support full range of values allowed for SYCL.

masterleinad commented 2 years ago

@masterleinad, could you check if OpenCL back-end has such limitation by setting SYCL_DEVICE_FILTER=opencl:gpu, please?

It seems to work with the OpenCL back end.

TApplencourt commented 2 years ago

The bug is still present in Compiler 2022.1.0 (2022.x.0.20220503) with the L0 backend (agama 449)

$cat master.cpp
#include <iostream>
#include <CL/sycl.hpp>

int main(int, char**) {
   sycl::queue Q;
   size_t N = 4298000000;
   Q.parallel_for(N, [=](auto i) {}).wait();
}
$dpcpp master.cpp -fno-sycl-id-queries-fit-in-int
$./a.out
terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  PI backend failed. PI backend returns: -54 (CL_INVALID_WORK_GROUP_SIZE) -54 (CL_INVALID_WORK_GROUP_SIZE)
Aborted

Also if we have a WA when range >= numeric_limit<int>::max() can we to the conversion at runtime? Some of our users go bitten by this limitation. Just to add, Q.fill cannot be used to fill a buffer big than an int. I assure you that people allocate more than 4GB of memory and will try to set it... So having -fno-sycl-id-queries-fit-in-int by default may streamline user experience.

bader commented 2 years ago

I've discussed that issue with @bashbaug a few months ago and he told me that Level Zero driver doesn't support work sizes larger than 2^{32}. The application aborts as it doesn't handle the exception DPC++ runtime library throws to report about unsupported work-size. Is it possible to reduce the work size to meet low-level runtime requirements (e.g. by enqueuing kernel multiple times)?

TApplencourt commented 2 years ago

I've discussed that issue with @bashbaug a few months ago and he told me that Level Zero driver doesn't support work sizes larger than 2^{32}. The application aborts as it doesn't handle the exception DPC++ runtime library throws to report about unsupported work-size.

Oh, I see. Thanks for the update! Let me gather more info and come back to you. Compiling with fno-sycl-id-queries-fit-in-int make the run-time error disappear but I didn't yet check for the result correctness.

Is it possible to reduce the work size to meet low-level runtime requirements (e.g. by enqueuing kernel multiple times)?

It will be maybe more manageable to do it at the SYCL runtime level?

Indeed, each and every application will need to do that for each kernel submission (this can be a workaround with some nice abstraction). More painful, the work needs to be done also for each function that implicitly uses "parallel_for", for example, Q.fill. This one started being more tedious to implement as it required an understanding of the DPCPP runtime.

Edit: After talking to @jandres742, the "real" workaround is to set -ze-opt-greater-than-4GB-buffer-required when creating the module.

Edit2: Maybe also related to an IGC bug where get_global_id() only goes until UINT_MAX.

bader commented 1 year ago

One more work-around idea: I suppose if we explicitly set a work-group size, so that the # of work-groups will be < 2^{32}, the code from the issue description should work with Level Zero back-end. This will require using parallel_for kernel invocation function with nd_range argument instead of range.

TApplencourt commented 1 year ago

#include <iostream>
#include <CL/sycl.hpp>
#include <level_zero/ze_api.h>

int main(int, char**) {
   sycl::queue Q;
   sycl::device D = Q.get_device();

   auto zD = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(D);
   ze_device_compute_properties_t device_properties;
   zeDeviceGetComputeProperties(zD, &device_properties);

   //L0 spec may need to changed so this doesn't return an `uint32_t`
   uint32_t maxGroupCountX = device_properties.maxGroupCountX;
   uint32_t maxGroupSizeX = device_properties.maxGroupSizeX;
   size_t  maxWorkItemX = (size_t) maxGroupSizeX * maxGroupCountX;
   std::cout << "maxGroupSizeX " << maxGroupSizeX << std::endl;
   std::cout << "maxGroupCountX " << maxGroupCountX << std::endl;
   std::cout << "maxGroupSizeX*maxGroupCountX " << maxWorkItemX << std::endl;

   std::cout << "Sumiting kernel..." << std::endl;
   std::cout << "Submiting maxGroupCountX work-items kernel" << std::endl;
   Q.parallel_for(maxGroupCountX, [=](sycl::id<1> i) {}).wait();

   std::cout<< "Submitting maxGroupSizeX*maxGroupCountX work-items kernel" << std::endl;
   Q.parallel_for(maxWorkItemX, [=](sycl::id<1> i) {}).wait();
   // SYCL is a high-level language, that should run independently of any backend restriction
   std::cout<< "Submitting 2*maxGroupSizeX*maxGroupCountX work-items kernel" << std::endl;
   Q.parallel_for(2*maxWorkItemX, [=](sycl::id<1> i) {}).wait();
}

I wrote a simple set of reproducers. I think all of them should pass. Maybe it can help. Just to be clear this issue is blocking a lot of applications of running their large problem size ;(

My understanding is that SYCL doesn't have any "kernel wise sync". So we should be able to always split large work-item into whatever chunk size who are is available by the backend (assuming the local-group size specified fit ofc) .

bashbaug commented 1 year ago

So we should be able to always split large work-item into whatever chunk size who are is available by the backend (assuming the local-group size specified fit ofc) .

FWIW, this is surprisingly difficult to do in the general case. Note that the "global offset" functionality provided by OpenCL and Level Zero offsets the global ID, not the group ID, so this isn't sufficient by itself to do the splitting in the higher-level runtimes. For CUDA, there is no "global offset" or similar. We could probably figure out a way to make it work, but it'd be complicated (and probably a little icky).

Just to be clear this issue is blocking a lot of applications of running their large problem size ;(

Is there some reasonable upper bound on a "large problem size", or should we plan for a full 64-bit range?

TApplencourt commented 1 year ago

FWIW, this is surprisingly difficult to do in the general case. Note that the "global offset" functionality provided by OpenCL and Level Zero offsets the global ID, not the group ID, so this isn't sufficient by itself to do the splitting in the higher-level runtimes. For CUDA, there is no "global offset" or similar. We could probably figure out a way to make it work, but it'd be complicated (and probably a little icky).

I see, thanks for the explanation! As always, from the outside, everything looks easy :) I guess you will need to add a new kernel argument to handle the offset and the like. Sound icky indeed. I hope that this workaround is not mandatory and that the L0 backend can fix this issue. I ear that CUDA and OpenCL backend handle my reproducer fine.

Is there some reasonable upper bound on a "large problem size", or should we plan for a full 64-bit range?

To be honest, I don't know... I guess my hand-wavy answer is "as much as they are used running on NVIDIA". More than 32-bit, this is for sure. And I think less or equal to maxGroupSizeX * maxGroupCountX :) I think that our priority should be to get

   std::cout<< "Submitting maxGroupSizeX*maxGroupCountX work-items kernel" << std::endl;
   Q.parallel_for(maxWorkItemX, [=](sycl::id<1> i) {}).wait();

working. We care less about the 2*.maxWorkItemX case.

bashbaug commented 1 year ago

To be honest, I don't know... I guess my hand-wavy answer is "as much as they are used running on NVIDIA". More than 32-bit, this is for sure. And I think less or equal to maxGroupSizeX * maxGroupCountX :)

OK thanks, this is helpful.

HW-wise our limit is on the number of work-groups we can launch and the max work-group size (pretty sure other HW is similar). This means that launching a global range equal to max_group_size * max_group_count should work if the group size is equal to max_group_size, but it won't work if the group size is smaller.

TApplencourt commented 1 year ago

HW-wise our limit is on the number of work-groups we can launch and the max work-group size (pretty sure other HW is similar). This means that launching a global range equal to max_group_size * max_group_count should work if the group size is equal to max_group_size, but it won't work if the group size is smaller.

This sound like a valid limitation to me! If the user specifies a nd_range / group size they give up on some flexibilities. And in the case of range / no group size the group size algorithm should choose the "correct" group size for me to run

xtian-github commented 1 year ago

@smaslov-intel @bader do we have ETA for this issue to be resolved? Thomas/ANL is asking for it. Thanks.

smaslov-intel commented 1 year ago

A workaround is coming in https://github.com/intel/llvm/pull/7321 It will allow some work sizes greater than UINT32_MAX (those that are exactly devisable by some legal WG size)

KornevNikita commented 1 month ago

Hi! There have been no updates for at least the last 60 days, though the ticket has assignee(s).

@smaslov-intel, could I ask you to take one of the following actions? :)

Please provide an update if you have any (or just a small comment if you don't have any yet).
OR mark this issue with the 'confirmed' label if you have confirmed the problem/request and our team should work on it.
OR close the issue if it has been resolved.
OR take any other suitable action.

Thanks!

xtian-github commented 1 month ago

@KornevNikita SergeyM is on leave. I suggest SYCL to take a look to see what is a right fix to address this issue. Thanks.

intel / llvm

[Level Zero] sycl::parallel_for with ranges larger than INT_MAX deadlocks or aborts #4255