Open rudenkornk opened 3 years ago
This is a known limitation of the hierarchical parallelism implementation in DPCPP - all such hier par constructs (PFWG and PFWI) must be lexically included by the kernel, and can not reside in functions called from the kernel. Hierarchical parallelism is not advised to be used in real apps by SYCL experts, and is likely to be reworked in future SYCL spec versions - that's why it was considered not worthwhile to fix this limitation. But current behavior - silent failure - is definitely not OK, and at least compilation error should be issued.
+ @againull
I'm not able to reproduce the incorrect output. It seems to work as expected when compiling without optimizations but crashes otherwise:
$ clang++ -fsycl lambda.cpp -O0
$ ./a.out
Expected: 0; Computed without lambda: 0; Computed with lambda: 0
Expected: 0; Computed without lambda: 0; Computed with lambda: 0
Expected: 0; Computed without lambda: 0; Computed with lambda: 0
Expected: 0; Computed without lambda: 0; Computed with lambda: 0
Expected: 1; Computed without lambda: 1; Computed with lambda: 1
Expected: 1; Computed without lambda: 1; Computed with lambda: 1
Expected: 1; Computed without lambda: 1; Computed with lambda: 1
Expected: 1; Computed without lambda: 1; Computed with lambda: 1
Expected: 2; Computed without lambda: 2; Computed with lambda: 2
Expected: 2; Computed without lambda: 2; Computed with lambda: 2
Expected: 2; Computed without lambda: 2; Computed with lambda: 2
Expected: 2; Computed without lambda: 2; Computed with lambda: 2
Expected: 3; Computed without lambda: 3; Computed with lambda: 3
Expected: 3; Computed without lambda: 3; Computed with lambda: 3
Expected: 3; Computed without lambda: 3; Computed with lambda: 3
Expected: 3; Computed without lambda: 3; Computed with lambda: 3
$ clang++ -fsycl lambda.cpp -O1
$ ./a.out
Segmentation fault (core dumped)
$ clang++ -fsycl lambda.cpp -O2
$ ./a.out
Segmentation fault (core dumped)
$ clang++ -fsycl lambda.cpp -O3
$ ./a.out
Segmentation fault (core dumped)
I have a simple test which fills buffer with local ids. If parallel_for_work_item is invoked directly everything is fine. But if it is located inside lambda, only the first element of range is executed, even if range is set explicitly.
The test:
clang++ -fsycl LambdaTest.cpp
Output: