[SYCL][CUDA] report errors

zjin-lcf commented 3 years ago

Could you reproduce the results on an Nvidia GPU ?

Thanks.

https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/quicksort-sycl
The buffer size should not be zero.

make -f Makefile.cuda run

./main 10 2048 2048

terminate called after throwing an instance of 'cl::sycl::invalid_object_error'
  what():  SYCL buffer size is zero. To create a device accessor, SYCL buffer size must be greater than zero. -30 (CL_INVALID_VALUE)

https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/randomAccess-sycl

make -f Makefile.cuda run
./main
Table size = 67108864
Main table size   = 2^26 = 67108864 words
Number of updates = 268435456
Found 65817635 errors in 67108864 locations (failed).

https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/backprop-sycl

make -f Makefile.cuda run
./main 20000000
Random number generator seed: 7
Input layer size : 20000000
Starting training kernel
Performing GPU computation

PI CUDA ERROR:
        Value:           1
        Name:            CUDA_ERROR_INVALID_VALUE
        Description:     invalid argument
        Function:        cuda_piEnqueueKernelLaunch
        Source Location: /home/sycl-space/llvm/sycl/plugins/cuda/pi_cuda.cpp:2398

terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  OpenCL API failed. OpenCL API returns: -30 (CL_INVALID_VALUE) -30 (CL_INVALID_VALUE)

https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/sobol-sycl https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/lanczos-sycl

Cannot select: intrinsic %llvm.nvvm.implicit.offset

https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/reverse-sycl

Assertion `d[i] == len-i-1'

https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/convolutionSeparable-sycl

illegal memory access ( Note running the SYCL program on an Intel GPU has no illegal memory access. I think the SYCL program is functionally the same as the CUDA or HIP version)

zjin-lcf commented 3 years ago

I updated the report. The two applications, which was listed here before, can be executed successfully.

t4c1 commented 3 years ago

I checked the quicksort-sycl and reproduced the error. The program is actually allocating buffers with zero size (done_buffer in lqsort() and news_buffer in gqsort()), so the error is expected.

zjin-lcf commented 3 years ago

When running the same program on Intel DevCloud, the program does not report any error and finish successfully. Thank you for your explanation.

npmiller commented 3 years ago

I had a look at randomAccess-sycl, I'm not sure what's going on with it, however it seems that randomAccess-cuda fails with exactly the same output, so I suspect this isn't a SYCL issue:

make run 

nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 -c main.cu -o main.o
nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 main.o -o main 
./main
Table size = 67108864
Main table size   = 2^26 = 67108864 words
Number of updates = 268435456
Found 65817635 errors in 67108864 locations (failed).
make: *** [Makefile:57: run] Error 1

zjin-lcf commented 3 years ago

I will contact Nvidia about the possibility of a bug for the new NVSDK.

Running the HIP/OMP versions should pass.

Please ignore the error "Cannot select: intrinsic %llvm.nvvm.implicit.offset" which I fixed last week.

cuda 11.0 ~/oneAPI-DirectProgramming/randomAccess-cuda[master !x?*]$ make run nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 -c main.cu -o main.o nvcc warning : The -std=c++14 flag is not supported with the configured host compiler. Flag will be ignored. nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 main.o -o main nvcc warning : The -std=c++14 flag is not supported with the configured host compiler. Flag will be ignored. ./main Table size = 67108864 Main table size = 2^26 = 67108864 words Number of updates = 268435456 Found 0 errors in 67108864 locations (passed).

cuda 11.4

nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 -c main.cu -o main.o nvcc warning : The -std=c++14 flag is not supported with the configured host compiler. Flag will be ignored. nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 main.o -o main nvcc warning : The -std=c++14 flag is not supported with the configured host compiler. Flag will be ignored. ./main Table size = 67108864 Main table size = 2^26 = 67108864 words Number of updates = 268435456 Found 65817635 errors in 67108864 locations (failed).

npmiller commented 3 years ago

That's interesting, it does seem to still fail with DPC++ and CUDA 10.2 though, so maybe it's something in nvcc that the ptx backend also does, I'm curious to see if Nvidia have ideas about this.

Out of curiosity I did try it on gfx908 with the HIP plugin and that works fine.

Michoumichmich commented 3 years ago

If works on the host device too and ubsan does not report anything suspicious. What's surprising is that we always get the same number of errors: 65817635. Is it code-gen ? I don't think it's coming from the atomics else there would be some kind of data-race, maybe. That's so confusing...

t4c1 commented 3 years ago

I looked into backprop-sycl. The error is due to exceeding the maximum number of work groups CUDA allows in dimension Y. I also compared it to the backprop-cuda. While the code looks similar, CUDA and SYCL map work-items to processing elements in a different way. Usually this does not matter for correctness of the code, but you are launching a large number of work groups in the first dimension in SYCL, which is mapped to CUDA's Y dimension that does not allow such large sizes. Essentially you need to reverse the order of dimensions in SYCL to make it work the same as CUDA benchmark.

You can get more information here (while it speaks about ComputeCPP and OpenCL, the same is true for DPC++ and CUDA): https://www.codeplay.com/portal/blogs/2019/11/18/computecpp-v1-1-6-changes-to-work-item-mapping-optimization.html

I will also try to make DPC++ print clearer error message for this error.

Michoumichmich commented 3 years ago

I looked into backprop-sycl. The error is due to exceeding the maximum number of work groups CUDA allows in dimension Y. I also compared it to the backprop-cuda. While the code looks similar, CUDA and SYCL map work-items to processing elements in a different way. Usually this does not matter for correctness of the code, but you are launching a large number of work groups in the first dimension in SYCL, which is mapped to CUDA's Y dimension that does not allow such large sizes. Essentially you need to reverse the order of dimensions in SYCL to make it work the same as CUDA benchmark.

You can get more information here (while it speaks about ComputeCPP and OpenCL, the same is true for DPC++ and CUDA): https://www.codeplay.com/portal/blogs/2019/11/18/computecpp-v1-1-6-changes-to-work-item-mapping-optimization.html

I will also try to make DPC++ print clearer error message for this error.

Btw, I opened this https://github.com/intel/llvm/pull/4064 pull request to fix that issue (but there are still ABI naming discussions)

t4c1 commented 3 years ago

As far as I understand it that pull request adds the option to query for these limits, but it does not fix the error in sense that it would make the error message clearer or that the changes to this benchmark would not be necessary.

Michoumichmich commented 3 years ago

As far as I understand it that pull request adds the option to query for these limits, but it does not fix the error in sense that it would make the error message clearer or that the changes to this benchmark would not be necessary.

The point would be to give the user the ability to check for the sizes and prove correctness, so these errors wouldn't happen if a "safe code". On the error reporting topic, DPC++ does report some errors when you submit more than INT_MAX work-items. But no API call is ever made to check individual CUDA sizes so right now that error isn't even detected.

zjin-lcf commented 3 years ago

I looked into backprop-sycl. The error is due to exceeding the maximum number of work groups CUDA allows in dimension Y. I also compared it to the backprop-cuda. While the code looks similar, CUDA and SYCL map work-items to processing elements in a different way. Usually this does not matter for correctness of the code, but you are launching a large number of work groups in the first dimension in SYCL, which is mapped to CUDA's Y dimension that does not allow such large sizes. Essentially you need to reverse the order of dimensions in SYCL to make it work the same as CUDA benchmark.

You can get more information here (while it speaks about ComputeCPP and OpenCL, the same is true for DPC++ and CUDA): https://www.codeplay.com/portal/blogs/2019/11/18/computecpp-v1-1-6-changes-to-work-item-mapping-optimization.html

I will also try to make DPC++ print clearer error message for this error.

Sorry, backdrop was ported more than a year ago. I will format the original codes and fix the indexing. Thank you for investigating the issue, explaining the cause of the error, and proposing informative error message for better understanding.

pgorlani commented 2 years ago

I had a look at the reverese-sycl benchmark, the problem seems related to the access mode of d, which is discard_read_write. This mode instructs the runtime that the initial values are not required by the kernel so implies the possibility that the content is non-initialized and then discarded. This creates the error, since the accessor d does not provide access to the data within the buffer dd, defining the access mode of d as read_write solves the problem. For some devices/backend implementations, discard_read_write could act as read_write, but this is absolutely not guaranteed and must be avoided if the initial values of the buffer are required by the kernel and the final ones need to be reported back. Moreover, the final test should consider if the total number of reverse operation is even or odd.

zjin-lcf commented 2 years ago

After reading your comments/suggestions, I fix the errors and upgraded the examples. Thank you for pointing out the errors and the suggestions!

pgorlani commented 2 years ago

Thanks for your feedback, zjin-lcf. I also had a look at the convolutionSeparable-sycl benchmark. The illegal memory access is created by the pointer computation happening in the convolutionColumns function in conv.cpp at line 123 and 124. Consider, for example, line 123:

float* dst_new = dst.get_pointer() + baseY * pitch + baseX;

whereas baseY is an int variable and pitch is an unsigned int function argument. Basically, the fact that pitch is declared as unsigned int makes baseY cast to a unsigned int for the purpose of the multiplication. Unfortunately, baseY is negative when gidY is 0, this causes the illegal memory error due to out-of-bound memory accesses.

One possible solution, that maintains the type consistency with the CUDA SDK example, is declaring the unsigned int arguments of convolutionColumns as int, this doesn't generate the integer cast avoiding the problem.

zjin-lcf commented 2 years ago

After reading your comments, I changed the types of these arguments from uint to int. Thank you for pointing out the error and suggestions!

pgorlani commented 2 years ago

Thanks again for your feedback! As far as I can see, all the errors reported in this issue seem resolved. Shall we close it?

zjin-lcf commented 2 years ago

The randomAccess-cuda example runs successfully with CUDA 11.0 on an Nvidia GPU (e.g. P100), and then fails with newer CUDA SDKs.
The randomAccess-sycl example did not run successfully with CUDA 11.0 and newer CUDA SDKs. If my observations are not right, or there are errors in the codes, please let us know. Thanks.

nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 -c main.cu -o main.o nvcc -std=c++14 -Xcompiler -Wall -arch=sm_60 -O3 main.o -o main ./main Table size = 67108864 Main table size = 2^26 = 67108864 words Number of updates = 268435456 Found 65817635 errors in 67108864 locations (failed). make: *** [Makefile:57: run] Error 1

AidanBeltonS commented 2 years ago

I have taken a second look at the randomAccess-sycl benchmark

The issue appears to be in llvm's nvptx code generation where

temp = (temp << 1) ^ (((s64Int) temp < 0) ? POLY : 0);

is incorrectly replaced with a bit field extract operation. This substitution is performed because the constant POLY's value is 7 which has all 1's then all 0's. So if POLY is replaced with a value that is not 2^x-1 this benchmark will work.

I have proposed a fix to llvm: https://reviews.llvm.org/D117118

A temporary patch for this issue is:

diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index dd4290a605a9..774ffb17546a 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -3405,7 +3405,7 @@ bool NVPTXDAGToDAGISel::tryBFE(SDNode *N) {
     }

     // How many bits are in our mask?
-    uint64_t NumBits = countTrailingOnes(MaskVal);
+    int64_t NumBits = countTrailingOnes(MaskVal);
     Len = CurDAG->getTargetConstant(NumBits, DL, MVT::i32);

     if (LHS.getOpcode() == ISD::SRL || LHS.getOpcode() == ISD::SRA) {
@@ -3417,7 +3417,7 @@ bool NVPTXDAGToDAGISel::tryBFE(SDNode *N) {
         uint64_t StartVal = StartConst->getZExtValue();
         // How many "good" bits do we have left?  "good" is defined here as bits
         // that exist in the original value, not shifted in.
-        uint64_t GoodBits = Start.getValueSizeInBits() - StartVal;
+        int64_t GoodBits = Start.getValueSizeInBits() - StartVal;
         if (NumBits > GoodBits) {
           // Do not handle the case where bits have been shifted in. In theory
           // we could handle this, but the cost is likely higher than just

AidanBeltonS commented 2 years ago

I also investigated the randomAccess-cuda error. It is caused by different issues to the sycl benchmark. There is a bug with at least CUDA 11.2 that causes Table to be incorrectly initialised with 0's rather than 0,1,2,3,4,etc. So the problem appears to be with the kernel initTable.

I have reported the bug to Nvidia.

Example:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <iostream>

typedef unsigned long long int u64Int;
typedef long long int s64Int;

/* CUDA specific parameters */
#define K1_BLOCKSIZE  256
#define K2_BLOCKSIZE  128
#define K3_BLOCKSIZE  128

__global__ void initTable (u64Int* Table, const u64Int TableSize) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < TableSize) {
    Table[i] = i;
  }
}

int main(int argc, char** argv) {
  u64Int *Table = NULL;
  u64Int TableSize = 10;

  posix_memalign((void**)&Table, 1024, TableSize * sizeof(u64Int));

  u64Int* d_Table;
  cudaMalloc((void**)&d_Table, TableSize * sizeof(u64Int));

  /* initialize the table */
  std::cout << "TableSize:" << TableSize << std::endl;
  initTable<<<(TableSize+K1_BLOCKSIZE-1) / K1_BLOCKSIZE, K1_BLOCKSIZE>>>(d_Table, TableSize);
  cudaMemcpy(Table, d_Table, TableSize * sizeof(u64Int), cudaMemcpyDeviceToHost);
  for (int i = 0; i < 10; ++i)
    std::cout << Table[i] << ",";
  std::cout << std::endl;

  free( Table );
  cudaFree(d_Table);
  return 0;

}

Results:

CUDA 10.2

TableSize:10
0,1,2,3,4,5,6,7,8,9,

CUDA 11.2

TableSize:10
0,0,0,0,0,0,0,0,0,0,

zjin-lcf commented 2 years ago

Thank you very much for your analysis and reporting.

AerialMantis commented 2 years ago

With the patch for the issue seen in randomAccess-sycl in review in upstream LLVM (https://reviews.llvm.org/D117118) I believe on acceptance of that, this will resolve the last of the issues described in this ticket, could you please confirm this @zjin-lcf ?

zjin-lcf commented 2 years ago

I will close the issue. Thank you!

zjin-lcf commented 1 year ago

https://reviews.llvm.org/D117118 is not complete. Thank you for the review.

abagusetty commented 1 year ago

I too see a similar error:

abagusetty@polaris-login-02 ~ $ clang++ -std=c++17 -fsycl -fsycl-targets=nvptx64-nvdia-cuda  -Xsycl-target-backend --cuda-gpu-arch=sm_80 test_reddit.cpp
warning: linking module '/soft/compilers/oneapi/llvm/build_cuda11.4.4_06082023/lib/clang/17/../../clc/remangled-l64-signed_char.libspirv-nvptx64-nvidia-cuda.bc': Linking two modules of different target triples: '/soft/compilers/oneapi/llvm/build_cuda11.4.4_06082023/lib/clang/17/../../clc/remangled-l64-signed_char.libspirv-nvptx64-nvidia-cuda.bc' is 'nvptx64-nvidia-cuda' whereas 'test_reddit.cpp' is 'nvptx64-nvdia-cuda'
 [-Wlinker-warnings]
1 warning generated.
fatal error: error in backend: Cannot select: intrinsic %llvm.nvvm.implicit.offset
llvm-foreach: 
clang++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version 17.0.0 (https://github.com/intel/llvm.git 1496c57722c7db8db7e582b582317e15e719ceb0)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /soft/compilers/oneapi/llvm/build_cuda11.4.4_06082023/bin
clang++: note: diagnostic msg: Error generating preprocessed source(s).

Using this reproducer

#include <sycl/sycl.hpp>
 struct test_struct {
    int* data = nullptr;
  };

int test(test_struct **t){
      sycl::queue q;
      *t =  sycl::malloc_shared<test_struct>(1, q);
      int* host_res = (int*) malloc(20 * sizeof(int));
      size_t size = 20;
      (*t)->data = sycl::malloc_device<int>(size, q);

      test_struct *t2 = *t;
      q.parallel_for(sycl::range<1>(size), [=](sycl::id<1> i) {
          t2->data[i] = i;
      }).wait();
      q.memcpy(host_res,(*t)->data,size * sizeof(int)).wait();
      for (size_t i = 0; i < size; i++)
      {
          std::cout << host_res[i] << std::endl;
      }
      sycl::free((*t)->data, q);

   return 0;
}

int main() {
  test_struct *t;
  test(&t);
  return 0;
}

@npmiller Can you please help on this.

npmiller commented 1 year ago

Hello @abagusetty, this is not really related to this ticket, you just have a typo in the triple, it's missing an i: nvptx64-nvdia-cuda instead of nvptx64-nvidia-cuda.

We should probably add a proper check for that with a better error message, this is super confusing, I ran into the same issue in the past.

Also to give a quick update on the randomAccess-sycl issue, sorry it got a little stuck in review upstream but we've been pushing for it again and I'm hoping it gets merged somewhat soon.

abagusetty commented 1 year ago

@npmiller Thanks that was utter stupid of me. The above reproducer yields correct results.

zjin-lcf commented 1 year ago

Thanks

npmiller commented 1 year ago

Hello @zjin-lcf the patch was finally merged in usptream LLVM, hopefully it'll land in DPC++ in the next pulldown:

https://reviews.llvm.org/D117118

zjin-lcf commented 1 year ago

@npmiller I suppose that it has landed in DPC++

intel / llvm

[SYCL][CUDA] report errors #2793