Closed zjin-lcf closed 1 year ago
I updated the report. The two applications, which was listed here before, can be executed successfully.
I checked the quicksort-sycl
and reproduced the error. The program is actually allocating buffers with zero size (done_buffer
in lqsort()
and news_buffer
in gqsort()
), so the error is expected.
When running the same program on Intel DevCloud, the program does not report any error and finish successfully. Thank you for your explanation.
I had a look at randomAccess-sycl
, I'm not sure what's going on with it, however it seems that randomAccess-cuda
fails with exactly the same output, so I suspect this isn't a SYCL issue:
make run
nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 -c main.cu -o main.o
nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 main.o -o main
./main
Table size = 67108864
Main table size = 2^26 = 67108864 words
Number of updates = 268435456
Found 65817635 errors in 67108864 locations (failed).
make: *** [Makefile:57: run] Error 1
I will contact Nvidia about the possibility of a bug for the new NVSDK.
Running the HIP/OMP versions should pass.
Please ignore the error "Cannot select: intrinsic %llvm.nvvm.implicit.offset" which I fixed last week.
cuda 11.0 ~/oneAPI-DirectProgramming/randomAccess-cuda[master !x?*]$ make run nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 -c main.cu -o main.o nvcc warning : The -std=c++14 flag is not supported with the configured host compiler. Flag will be ignored. nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 main.o -o main nvcc warning : The -std=c++14 flag is not supported with the configured host compiler. Flag will be ignored. ./main Table size = 67108864 Main table size = 2^26 = 67108864 words Number of updates = 268435456 Found 0 errors in 67108864 locations (passed).
cuda 11.4
nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 -c main.cu -o main.o nvcc warning : The -std=c++14 flag is not supported with the configured host compiler. Flag will be ignored. nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 main.o -o main nvcc warning : The -std=c++14 flag is not supported with the configured host compiler. Flag will be ignored. ./main Table size = 67108864 Main table size = 2^26 = 67108864 words Number of updates = 268435456 Found 65817635 errors in 67108864 locations (failed).
That's interesting, it does seem to still fail with DPC++ and CUDA 10.2 though, so maybe it's something in nvcc that the ptx backend also does, I'm curious to see if Nvidia have ideas about this.
Out of curiosity I did try it on gfx908 with the HIP plugin and that works fine.
If works on the host device too and ubsan does not report anything suspicious. What's surprising is that we always get the same number of errors: 65817635. Is it code-gen ? I don't think it's coming from the atomics else there would be some kind of data-race, maybe. That's so confusing...
I looked into backprop-sycl
. The error is due to exceeding the maximum number of work groups CUDA allows in dimension Y. I also compared it to the backprop-cuda
. While the code looks similar, CUDA and SYCL map work-items to processing elements in a different way. Usually this does not matter for correctness of the code, but you are launching a large number of work groups in the first dimension in SYCL, which is mapped to CUDA's Y dimension that does not allow such large sizes. Essentially you need to reverse the order of dimensions in SYCL to make it work the same as CUDA benchmark.
You can get more information here (while it speaks about ComputeCPP and OpenCL, the same is true for DPC++ and CUDA): https://www.codeplay.com/portal/blogs/2019/11/18/computecpp-v1-1-6-changes-to-work-item-mapping-optimization.html
I will also try to make DPC++ print clearer error message for this error.
I looked into
backprop-sycl
. The error is due to exceeding the maximum number of work groups CUDA allows in dimension Y. I also compared it to thebackprop-cuda
. While the code looks similar, CUDA and SYCL map work-items to processing elements in a different way. Usually this does not matter for correctness of the code, but you are launching a large number of work groups in the first dimension in SYCL, which is mapped to CUDA's Y dimension that does not allow such large sizes. Essentially you need to reverse the order of dimensions in SYCL to make it work the same as CUDA benchmark.You can get more information here (while it speaks about ComputeCPP and OpenCL, the same is true for DPC++ and CUDA): https://www.codeplay.com/portal/blogs/2019/11/18/computecpp-v1-1-6-changes-to-work-item-mapping-optimization.html
I will also try to make DPC++ print clearer error message for this error.
Btw, I opened this https://github.com/intel/llvm/pull/4064 pull request to fix that issue (but there are still ABI naming discussions)
As far as I understand it that pull request adds the option to query for these limits, but it does not fix the error in sense that it would make the error message clearer or that the changes to this benchmark would not be necessary.
As far as I understand it that pull request adds the option to query for these limits, but it does not fix the error in sense that it would make the error message clearer or that the changes to this benchmark would not be necessary.
The point would be to give the user the ability to check for the sizes and prove correctness, so these errors wouldn't happen if a "safe code". On the error reporting topic, DPC++ does report some errors when you submit more than INT_MAX work-items. But no API call is ever made to check individual CUDA sizes so right now that error isn't even detected.
I looked into
backprop-sycl
. The error is due to exceeding the maximum number of work groups CUDA allows in dimension Y. I also compared it to thebackprop-cuda
. While the code looks similar, CUDA and SYCL map work-items to processing elements in a different way. Usually this does not matter for correctness of the code, but you are launching a large number of work groups in the first dimension in SYCL, which is mapped to CUDA's Y dimension that does not allow such large sizes. Essentially you need to reverse the order of dimensions in SYCL to make it work the same as CUDA benchmark.You can get more information here (while it speaks about ComputeCPP and OpenCL, the same is true for DPC++ and CUDA): https://www.codeplay.com/portal/blogs/2019/11/18/computecpp-v1-1-6-changes-to-work-item-mapping-optimization.html
I will also try to make DPC++ print clearer error message for this error.
Sorry, backdrop was ported more than a year ago. I will format the original codes and fix the indexing. Thank you for investigating the issue, explaining the cause of the error, and proposing informative error message for better understanding.
I had a look at the reverese-sycl
benchmark, the problem seems related to the access mode of d
, which is discard_read_write
. This mode instructs the runtime that the initial values are not required by the kernel so implies the possibility that the content is non-initialized and then discarded. This creates the error, since the accessor d
does not provide access to the data within the buffer dd
, defining the access mode of d
as read_write
solves the problem. For some devices/backend implementations, discard_read_write
could act as read_write
, but this is absolutely not guaranteed and must be avoided if the initial values of the buffer are required by the kernel and the final ones need to be reported back. Moreover, the final test should consider if the total number of reverse operation is even or odd.
After reading your comments/suggestions, I fix the errors and upgraded the examples. Thank you for pointing out the errors and the suggestions!
Thanks for your feedback, zjin-lcf. I also had a look at the convolutionSeparable-sycl
benchmark. The illegal memory access is created by the pointer computation happening in the convolutionColumns
function in conv.cpp
at line 123 and 124. Consider, for example, line 123:
float* dst_new = dst.get_pointer() + baseY * pitch + baseX;
whereas baseY
is an int
variable and pitch
is an unsigned int
function argument. Basically, the fact that pitch
is declared as unsigned int
makes baseY
cast to a unsigned int
for the purpose of the multiplication. Unfortunately, baseY
is negative when gidY
is 0
, this causes the illegal memory error due to out-of-bound memory accesses.
One possible solution, that maintains the type consistency with the CUDA SDK example, is declaring the unsigned int
arguments of convolutionColumns
as int
, this doesn't generate the integer cast avoiding the problem.
After reading your comments, I changed the types of these arguments from uint to int. Thank you for pointing out the error and suggestions!
Thanks again for your feedback! As far as I can see, all the errors reported in this issue seem resolved. Shall we close it?
The randomAccess-cuda example runs successfully with CUDA 11.0 on an Nvidia GPU (e.g. P100), and then fails with newer CUDA SDKs.
The randomAccess-sycl example did not run successfully with CUDA 11.0 and newer CUDA SDKs.
If my observations are not right, or there are errors in the codes, please let us know. Thanks.
nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 -c main.cu -o main.o nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 main.o -o main ./main Table size = 67108864 Main table size = 2^26 = 67108864 words Number of updates = 268435456 Found 65817635 errors in 67108864 locations (failed). make: *** [Makefile:57: run] Error 1
I have taken a second look at the randomAccess-sycl benchmark
The issue appears to be in llvm's nvptx code generation where
temp = (temp << 1) ^ (((s64Int) temp < 0) ? POLY : 0);
is incorrectly replaced with a bit field extract operation.
This substitution is performed because the constant POLY's value is 7
which has all 1's then all 0's.
So if POLY is replaced with a value that is not 2^x-1 this benchmark will work.
I have proposed a fix to llvm: https://reviews.llvm.org/D117118
A temporary patch for this issue is:
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index dd4290a605a9..774ffb17546a 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -3405,7 +3405,7 @@ bool NVPTXDAGToDAGISel::tryBFE(SDNode *N) {
}
// How many bits are in our mask?
- uint64_t NumBits = countTrailingOnes(MaskVal);
+ int64_t NumBits = countTrailingOnes(MaskVal);
Len = CurDAG->getTargetConstant(NumBits, DL, MVT::i32);
if (LHS.getOpcode() == ISD::SRL || LHS.getOpcode() == ISD::SRA) {
@@ -3417,7 +3417,7 @@ bool NVPTXDAGToDAGISel::tryBFE(SDNode *N) {
uint64_t StartVal = StartConst->getZExtValue();
// How many "good" bits do we have left? "good" is defined here as bits
// that exist in the original value, not shifted in.
- uint64_t GoodBits = Start.getValueSizeInBits() - StartVal;
+ int64_t GoodBits = Start.getValueSizeInBits() - StartVal;
if (NumBits > GoodBits) {
// Do not handle the case where bits have been shifted in. In theory
// we could handle this, but the cost is likely higher than just
I also investigated the randomAccess-cuda error. It is caused by different issues to the sycl benchmark.
There is a bug with at least CUDA 11.2 that causes Table to be incorrectly initialised with 0's rather than 0,1,2,3,4,etc.
So the problem appears to be with the kernel initTable
.
I have reported the bug to Nvidia.
Example:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <iostream>
typedef unsigned long long int u64Int;
typedef long long int s64Int;
/* CUDA specific parameters */
#define K1_BLOCKSIZE 256
#define K2_BLOCKSIZE 128
#define K3_BLOCKSIZE 128
__global__ void initTable (u64Int* Table, const u64Int TableSize) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < TableSize) {
Table[i] = i;
}
}
int main(int argc, char** argv) {
u64Int *Table = NULL;
u64Int TableSize = 10;
posix_memalign((void**)&Table, 1024, TableSize * sizeof(u64Int));
u64Int* d_Table;
cudaMalloc((void**)&d_Table, TableSize * sizeof(u64Int));
/* initialize the table */
std::cout << "TableSize:" << TableSize << std::endl;
initTable<<<(TableSize+K1_BLOCKSIZE-1) / K1_BLOCKSIZE, K1_BLOCKSIZE>>>(d_Table, TableSize);
cudaMemcpy(Table, d_Table, TableSize * sizeof(u64Int), cudaMemcpyDeviceToHost);
for (int i = 0; i < 10; ++i)
std::cout << Table[i] << ",";
std::cout << std::endl;
free( Table );
cudaFree(d_Table);
return 0;
}
Results:
CUDA 10.2
TableSize:10
0,1,2,3,4,5,6,7,8,9,
CUDA 11.2
TableSize:10
0,0,0,0,0,0,0,0,0,0,
Thank you very much for your analysis and reporting.
With the patch for the issue seen in randomAccess-sycl in review in upstream LLVM (https://reviews.llvm.org/D117118) I believe on acceptance of that, this will resolve the last of the issues described in this ticket, could you please confirm this @zjin-lcf ?
I will close the issue. Thank you!
https://reviews.llvm.org/D117118 is not complete. Thank you for the review.
I too see a similar error:
abagusetty@polaris-login-02 ~ $ clang++ -std=c++17 -fsycl -fsycl-targets=nvptx64-nvdia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 test_reddit.cpp
warning: linking module '/soft/compilers/oneapi/llvm/build_cuda11.4.4_06082023/lib/clang/17/../../clc/remangled-l64-signed_char.libspirv-nvptx64-nvidia-cuda.bc': Linking two modules of different target triples: '/soft/compilers/oneapi/llvm/build_cuda11.4.4_06082023/lib/clang/17/../../clc/remangled-l64-signed_char.libspirv-nvptx64-nvidia-cuda.bc' is 'nvptx64-nvidia-cuda' whereas 'test_reddit.cpp' is 'nvptx64-nvdia-cuda'
[-Wlinker-warnings]
1 warning generated.
fatal error: error in backend: Cannot select: intrinsic %llvm.nvvm.implicit.offset
llvm-foreach:
clang++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version 17.0.0 (https://github.com/intel/llvm.git 1496c57722c7db8db7e582b582317e15e719ceb0)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /soft/compilers/oneapi/llvm/build_cuda11.4.4_06082023/bin
clang++: note: diagnostic msg: Error generating preprocessed source(s).
Using this reproducer
#include <sycl/sycl.hpp>
struct test_struct {
int* data = nullptr;
};
int test(test_struct **t){
sycl::queue q;
*t = sycl::malloc_shared<test_struct>(1, q);
int* host_res = (int*) malloc(20 * sizeof(int));
size_t size = 20;
(*t)->data = sycl::malloc_device<int>(size, q);
test_struct *t2 = *t;
q.parallel_for(sycl::range<1>(size), [=](sycl::id<1> i) {
t2->data[i] = i;
}).wait();
q.memcpy(host_res,(*t)->data,size * sizeof(int)).wait();
for (size_t i = 0; i < size; i++)
{
std::cout << host_res[i] << std::endl;
}
sycl::free((*t)->data, q);
return 0;
}
int main() {
test_struct *t;
test(&t);
return 0;
}
@npmiller Can you please help on this.
Hello @abagusetty, this is not really related to this ticket, you just have a typo in the triple, it's missing an i
: nvptx64-nvdia-cuda
instead of nvptx64-nvidia-cuda
.
We should probably add a proper check for that with a better error message, this is super confusing, I ran into the same issue in the past.
Also to give a quick update on the randomAccess-sycl
issue, sorry it got a little stuck in review upstream but we've been pushing for it again and I'm hoping it gets merged somewhat soon.
@npmiller Thanks that was utter stupid of me. The above reproducer yields correct results.
Thanks
Hello @zjin-lcf the patch was finally merged in usptream LLVM, hopefully it'll land in DPC++ in the next pulldown:
@npmiller I suppose that it has landed in DPC++
Could you reproduce the results on an Nvidia GPU ?
Thanks.
https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/quicksort-sycl
The buffer size should not be zero.
https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/randomAccess-sycl
https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/backprop-sycl
https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/sobol-sycl https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/lanczos-sycl
https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/reverse-sycl
https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/convolutionSeparable-sycl