parallel_for_work_item does not work inside lambdas

rudenkornk commented 3 years ago

I have a simple test which fills buffer with local ids. If parallel_for_work_item is invoked directly everything is fine. But if it is located inside lambda, only the first element of range is executed, even if range is set explicitly.

The test:

#include <CL/sycl.hpp>

void FillLambda(cl::sycl::queue &queue, std::vector<int> &vec,
                size_t workGroupSize) {
  using namespace cl::sycl;
  auto buf = buffer{vec};
  auto nWorkGroups = vec.size() / workGroupSize;

  queue.submit([&](handler &h) {
    auto global = buf.template get_access<access::mode::read_write>(h);
    auto lambda = [=](group<1> g) {
      g.parallel_for_work_item([=](h_item<1> it) {
        global[it.get_global_id()[0]] = it.get_local_id()[0];
      });
    };

    h.parallel_for_work_group<class LambdaTestKernel>(
        range<1>{nWorkGroups}, range<1>{workGroupSize},
        [=](group<1> g) { lambda(g); });
  });

  queue.wait();
  buf.template get_access<cl::sycl::access::mode::read_write>();
}

void FillWithoutLambda(cl::sycl::queue &queue, std::vector<int> &vec,
                       size_t workGroupSize) {
  using namespace cl::sycl;
  auto buf = buffer{vec};
  auto nWorkGroups = vec.size() / workGroupSize;

  queue.submit([&](handler &h) {
    auto global = buf.template get_access<access::mode::read_write>(h);
    h.parallel_for_work_group<class WithoutLambdaTestKernel>(
        range<1>{nWorkGroups}, range<1>{workGroupSize}, [=](group<1> g) {
          g.parallel_for_work_item([=](h_item<1> it) {
            global[it.get_global_id()[0]] = it.get_local_id()[0];
          });
        });
  });
  queue.wait();
  buf.template get_access<cl::sycl::access::mode::read_write>();
}

int main() {
  auto size = 16;
  auto workGroupSize = 4;
  auto nWorkGroups = size / workGroupSize;
  auto emptyVec = std::vector<int>(size);
  auto lambdaVec = emptyVec;
  auto withoutLambdaVec = emptyVec;

  auto GPUSelector = cl::sycl::gpu_selector{};
  auto queue = cl::sycl::queue{GPUSelector};

  FillLambda(queue, lambdaVec, workGroupSize);
  FillWithoutLambda(queue, withoutLambdaVec, workGroupSize);

  std::sort(lambdaVec.begin(), lambdaVec.end());
  std::sort(withoutLambdaVec.begin(), withoutLambdaVec.end());
  for (auto i = 0; i != size; ++i)
    std::cout << "Expected: " << i / nWorkGroups
              << "; Computed without lambda: " << withoutLambdaVec[i]
              << "; Computed with lambda: " << lambdaVec[i] << std::endl;
}

clang++ -fsycl LambdaTest.cpp

Output:

Expected: 0; Computed without lambda: 0; Computed with lambda: 0
Expected: 0; Computed without lambda: 0; Computed with lambda: 0
Expected: 0; Computed without lambda: 0; Computed with lambda: 0
Expected: 0; Computed without lambda: 0; Computed with lambda: 0
Expected: 1; Computed without lambda: 1; Computed with lambda: 0
Expected: 1; Computed without lambda: 1; Computed with lambda: 0
Expected: 1; Computed without lambda: 1; Computed with lambda: 0
Expected: 1; Computed without lambda: 1; Computed with lambda: 0
Expected: 2; Computed without lambda: 2; Computed with lambda: 0
Expected: 2; Computed without lambda: 2; Computed with lambda: 0
Expected: 2; Computed without lambda: 2; Computed with lambda: 0
Expected: 2; Computed without lambda: 2; Computed with lambda: 0
Expected: 3; Computed without lambda: 3; Computed with lambda: 0
Expected: 3; Computed without lambda: 3; Computed with lambda: 0
Expected: 3; Computed without lambda: 3; Computed with lambda: 0
Expected: 3; Computed without lambda: 3; Computed with lambda: 0

kbobrovs commented 3 years ago

This is a known limitation of the hierarchical parallelism implementation in DPCPP - all such hier par constructs (PFWG and PFWI) must be lexically included by the kernel, and can not reside in functions called from the kernel. Hierarchical parallelism is not advised to be used in real apps by SYCL experts, and is likely to be reworked in future SYCL spec versions - that's why it was considered not worthwhile to fix this limitation. But current behavior - silent failure - is definitely not OK, and at least compilation error should be issued.

kbobrovs commented 3 years ago

+ @againull

0x12CC commented 2 months ago

I'm not able to reproduce the incorrect output. It seems to work as expected when compiling without optimizations but crashes otherwise:

$ clang++ -fsycl lambda.cpp -O0
$ ./a.out
Expected: 0; Computed without lambda: 0; Computed with lambda: 0
Expected: 0; Computed without lambda: 0; Computed with lambda: 0
Expected: 0; Computed without lambda: 0; Computed with lambda: 0
Expected: 0; Computed without lambda: 0; Computed with lambda: 0
Expected: 1; Computed without lambda: 1; Computed with lambda: 1
Expected: 1; Computed without lambda: 1; Computed with lambda: 1
Expected: 1; Computed without lambda: 1; Computed with lambda: 1
Expected: 1; Computed without lambda: 1; Computed with lambda: 1
Expected: 2; Computed without lambda: 2; Computed with lambda: 2
Expected: 2; Computed without lambda: 2; Computed with lambda: 2
Expected: 2; Computed without lambda: 2; Computed with lambda: 2
Expected: 2; Computed without lambda: 2; Computed with lambda: 2
Expected: 3; Computed without lambda: 3; Computed with lambda: 3
Expected: 3; Computed without lambda: 3; Computed with lambda: 3
Expected: 3; Computed without lambda: 3; Computed with lambda: 3
Expected: 3; Computed without lambda: 3; Computed with lambda: 3
$ clang++ -fsycl lambda.cpp -O1
$ ./a.out
Segmentation fault (core dumped)
$ clang++ -fsycl lambda.cpp -O2
$ ./a.out
Segmentation fault (core dumped)
$ clang++ -fsycl lambda.cpp -O3
$ ./a.out
Segmentation fault (core dumped)

intel / llvm

parallel_for_work_item does not work inside lambdas #3006