I'm getting slower performance than CUDA for Matrix Multiplication

Hi! how are you?

I'm posting here because I have a question related with a comparison between CUDA and SYCL (using oneAPI) because CUDA is getting ~600 GFLOPs more than SYCL. I have the following CUDA cuda that represents Matrix Multiplication.

For compile, you can use: nvcc program.cu -o program_cuda -O3

clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda program.cpp -o program_sycl -O3

For run, you can use:

./program 16384 16384 16384 16384

My theory is that the "shared" in CUDA is obtaining better performance than the local accessors in SYCL, but I'm not sure really.

Thank you very much.

CUDA CODE:

/**
 * Matrix multiplication: C = A * B.
 * Host code.
 *
 * This sample implements matrix multiplication which makes use of shared memory
 * to ensure data reuse, the matrix multiplication is done using tiling
 * approach. It has been written for clarity of exposition to illustrate various
 * CUDA programming principles, not with the goal of providing the most
 * performant generic kernel for matrix multiplication. See also: V. Volkov and
 * J. Demmel, "Benchmarking GPUs to tune dense linear algebra," in Proc. 2008
 * ACM/IEEE Conf. on Supercomputing (SC '08), Piscataway, NJ: IEEE Press, 2008,
 * pp. Art. 31:1-11.
 */

// System includes
#include <iostream>

#include <assert.h>
#include <stdio.h>

// CUDA runtime
#include <cuda_runtime.h>
#include <sys/time.h>

double dwalltime() {
  double sec;
  struct timeval tv;

  gettimeofday(&tv, NULL);
  sec = tv.tv_sec + tv.tv_usec / 1000000.0;
  return sec;
}

/**
 * Matrix multiplication (CUDA Kernel) on the device: C = A * B
 * wA is A's width and wB is B's width
 */
template <int BLOCK_SIZE>
__global__ void MatrixMulCUDA(float *C, float *A, float *B, int wA, int wB) {
  // Block index
  int bx = blockIdx.x;
  int by = blockIdx.y;

  // Thread index
  int tx = threadIdx.x;
  int ty = threadIdx.y;

  // Index of the first sub-matrix of A processed by the block
  int aBegin = wA * BLOCK_SIZE * by;

  // Index of the last sub-matrix of A processed by the block
  int aEnd = aBegin + wA - 1;

  // Step size used to iterate through the sub-matrices of A
  int aStep = BLOCK_SIZE;

  // Index of the first sub-matrix of B processed by the block
  int bBegin = BLOCK_SIZE * bx;

  // Step size used to iterate through the sub-matrices of B
  int bStep = BLOCK_SIZE * wB;

  // Csub is used to store the element of the block sub-matrix
  // that is computed by the thread
  float Csub = 0;

  // Loop over all the sub-matrices of A and B
  // required to compute the block sub-matrix
  for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {
    // Declaration of the shared memory array As used to
    // store the sub-matrix of A
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

    // Declaration of the shared memory array Bs used to
    // store the sub-matrix of B
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

    // Load the matrices from device memory
    // to shared memory; each thread loads
    // one element of each matrix
    As[ty][tx] = A[a + wA * ty + tx];
    Bs[ty][tx] = B[b + wB * ty + tx];

    // Synchronize to make sure the matrices are loaded
    __syncthreads();

    // Multiply the two matrices together;
    // each thread computes one element
    // of the block sub-matrix
#pragma unroll
    for (int k = 0; k < BLOCK_SIZE; ++k) {
      Csub += As[ty][k] * Bs[k][tx];
    }

    // Synchronize to make sure that the preceding
    // computation is done before loading two new
    // sub-matrices of A and B in the next iteration
    __syncthreads();
  }

  // Write the block sub-matrix to device memory;
  // each thread writes one element
  int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
  C[c + wB * ty + tx] = Csub;
}

void ConstantInit(float *data, int size, float val) {
  for (int i = 0; i < size; ++i) {
    data[i] = val;
  }
}

/**
 * Run a simple test of matrix multiplication using CUDA
 */
int MatrixMultiply(int argc, char **argv, int block_size, const dim3 &dimsA,
                   const dim3 &dimsB) {
  // Allocate host memory for matrices A and B
  unsigned int size_A = dimsA.x * dimsA.y;
  unsigned int mem_size_A = sizeof(float) * size_A;
  float *h_A;
  cudaMallocHost(&h_A, mem_size_A);
  unsigned int size_B = dimsB.x * dimsB.y;
  unsigned int mem_size_B = sizeof(float) * size_B;
  float *h_B;
  cudaMallocHost(&h_B, mem_size_B);
  cudaStream_t stream;

  // Initialize host memory
  const float valB = 0.01f;
  ConstantInit(h_A, size_A, 1.0f);
  ConstantInit(h_B, size_B, valB);

  // Allocate device memory
  float *d_A, *d_B, *d_C;

  // Allocate host matrix C
  dim3 dimsC(dimsB.x, dimsA.y, 1);
  unsigned int mem_size_C = dimsC.x * dimsC.y * sizeof(float);
  float *h_C;
  cudaMallocHost(&h_C, mem_size_C);

  if (h_C == NULL) {
    fprintf(stderr, "Failed to allocate host matrix C!\n");
    exit(EXIT_FAILURE);
  }

  cudaMalloc(reinterpret_cast<void **>(&d_A), mem_size_A);
  cudaMalloc(reinterpret_cast<void **>(&d_B), mem_size_B);
  cudaMalloc(reinterpret_cast<void **>(&d_C), mem_size_C);

  cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

  // copy host memory to device

  cudaMemcpyAsync(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice, stream);
  cudaMemcpyAsync(d_B, h_B, mem_size_B, cudaMemcpyHostToDevice, stream);

  // Setup execution parameters
  dim3 threads(block_size, block_size);
  dim3 grid(dimsB.x / threads.x, dimsA.y / threads.y);

  // Create and start timer
  printf("Computing result using CUDA Kernel...\n");

  double start = dwalltime();

  // Performs warmup operation using matrixMul CUDA kernel
  if (block_size == 16) {
    MatrixMulCUDA<16>
        <<<grid, threads, 0, stream>>>(d_C, d_A, d_B, dimsA.x, dimsB.x);
  } else {
    MatrixMulCUDA<32>
        <<<grid, threads, 0, stream>>>(d_C, d_A, d_B, dimsA.x, dimsB.x);
  }

  cudaStreamSynchronize(stream);
  double stop = dwalltime();
  printf("done %lf\n", stop - start);

  // Record the start event
  start = dwalltime();

  // Execute the kernel
  int nIter = 10;

  for (int j = 0; j < nIter; j++) {
    if (block_size == 16) {
      MatrixMulCUDA<16>
          <<<grid, threads, 0, stream>>>(d_C, d_A, d_B, dimsA.x, dimsB.x);
    } else {
      MatrixMulCUDA<32>
          <<<grid, threads, 0, stream>>>(d_C, d_A, d_B, dimsA.x, dimsB.x);
    }
  }

  cudaStreamSynchronize(stream);

  stop = dwalltime();

  float msecTotal = (stop - start) * 1000;

  // Compute and print the performance
  float msecPerMatrixMul = msecTotal / nIter;
  double flopsPerMatrixMul = 2.0 * static_cast<double>(dimsA.x) *
                             static_cast<double>(dimsA.y) *
                             static_cast<double>(dimsB.x);
  double gigaFlops =
      (flopsPerMatrixMul * 1.0e-9f) / (msecPerMatrixMul / 1000.0f);
  printf("Performance= %.2f GFlop/s, Time= %.3f msec, Size= %.0f Ops,"
         " WorkgroupSize= %u threads/block\n",
         gigaFlops, msecPerMatrixMul, flopsPerMatrixMul, threads.x * threads.y);

  // Copy result from device to host
  cudaMemcpyAsync(h_C, d_C, mem_size_C, cudaMemcpyDeviceToHost, stream);
  cudaStreamSynchronize(stream);

  printf("Checking computed result for correctness: ");
  bool correct = true;

  // test relative error by the formula
  //     |<x, y>_cpu - <x,y>_gpu|/<|x|, |y|>  < eps
  double eps = 1.e-6; // machine zero

  for (int i = 0; i < static_cast<int>(dimsC.x * dimsC.y); i++) {
    double abs_err = fabs(h_C[i] - (dimsA.x * valB));
    double dot_length = dimsA.x;
    double abs_val = fabs(h_C[i]);
    double rel_err = abs_err / abs_val / dot_length;

    if (rel_err > eps) {
      printf("Error! Matrix[%05d]=%.8f, ref=%.8f error term is > %E\n", i,
             h_C[i], dimsA.x * valB, eps);
      correct = false;
    }
  }

  printf("%s\n", correct ? "Result = PASS" : "Result = FAIL");

  // Clean up memory
  cudaFreeHost(h_A);
  cudaFreeHost(h_B);
  cudaFreeHost(h_C);
  cudaFree(d_A);
  cudaFree(d_B);
  cudaFree(d_C);
  printf("\nNOTE: The CUDA Samples are not meant for performance "
         "measurements. Results may vary when GPU Boost is enabled.\n");

  if (correct) {
    return EXIT_SUCCESS;
  } else {
    return EXIT_FAILURE;
  }
}

/**
 * Program main
 */
int main(int argc, char **argv) {
  printf("[Matrix Multiply Using CUDA] - Starting...\n");

  int device_id = 0;
  cudaDeviceProp device_prop;

  cudaGetDeviceProperties(&device_prop, device_id);
  std::cout << "Device Name: " << device_prop.name << std::endl;

  int block_size = 16;

  dim3 dimsA(5 * 2 * block_size, 5 * 2 * block_size, 1);
  dim3 dimsB(5 * 4 * block_size, 5 * 2 * block_size, 1);

  // width of Matrix A from argc / argv directetly
  dimsA.x = argc > 1 ? atoi(argv[1]) : dimsA.x;
  dimsA.y = argc > 2 ? atoi(argv[2]) : dimsA.y;
  dimsB.x = argc > 3 ? atoi(argv[3]) : dimsB.x;
  dimsB.y = argc > 4 ? atoi(argv[4]) : dimsB.y;

  if (dimsA.x != dimsB.y) {
    printf("Error: outer matrix dimensions must be equal. (%d != %d)\n",
           dimsA.x, dimsB.y);
    exit(EXIT_FAILURE);
  }

  printf("MatrixA(%d,%d), MatrixB(%d,%d)\n", dimsA.x, dimsA.y, dimsB.x,
         dimsB.y);

  int matrix_result = MatrixMultiply(argc, argv, block_size, dimsA, dimsB);

  exit(matrix_result);
}

EQUIVALENT SYCL CODE:

/**
 * Matrix multiplication: C = A * B.
 * Host code.
 *
 * This sample implements matrix multiplication which makes use of shared memory
 * to ensure data reuse, the matrix multiplication is done using tiling
 * approach. It has been written for clarity of exposition to illustrate various
 * CUDA programming principles, not with the goal of providing the most
 * performant generic kernel for matrix multiplication. See also: V. Volkov and
 * J. Demmel, "Benchmarking GPUs to tune dense linear algebra," in Proc. 2008
 * ACM/IEEE Conf. on Supercomputing (SC '08), Piscataway, NJ: IEEE Press, 2008,
 * pp. Art. 31:1-11.
 */

// System includes
#include <assert.h>
#include <stdio.h>
#include <sycl/sycl.hpp>

// CUDA runtime
#include <cmath>
#include <sys/time.h>

double dwalltime() {
  double sec;
  struct timeval tv;

  gettimeofday(&tv, NULL);
  sec = tv.tv_sec + tv.tv_usec / 1000000.0;
  return sec;
}

/**
 * Matrix multiplication (CUDA Kernel) on the device: C = A * B
 * wA is A's width and wB is B's width
 */
template <int BLOCK_SIZE>
void MatrixMulCUDA(float *C, float *A, float *B, int wA, int wB,
                   sycl::nd_item<3> item_ct1,
                   sycl::accessor<float, 2, sycl::access::mode::read_write,
                                  sycl::access::target::local>
                       As,
                   sycl::accessor<float, 2, sycl::access::mode::read_write,
                                  sycl::access::target::local>
                       Bs) {
  // Block index
  int bx = item_ct1.get_group(2);
  int by = item_ct1.get_group(1);

  // Thread index
  int tx = item_ct1.get_local_id(2);
  int ty = item_ct1.get_local_id(1);

  // Index of the first sub-matrix of A processed by the block
  int aBegin = wA * BLOCK_SIZE * by;

  // Index of the last sub-matrix of A processed by the block
  int aEnd = aBegin + wA - 1;

  // Step size used to iterate through the sub-matrices of A
  int aStep = BLOCK_SIZE;

  // Index of the first sub-matrix of B processed by the block
  int bBegin = BLOCK_SIZE * bx;

  // Step size used to iterate through the sub-matrices of B
  int bStep = BLOCK_SIZE * wB;

  // Csub is used to store the element of the block sub-matrix
  // that is computed by the thread
  float Csub = 0;

  // Loop over all the sub-matrices of A and B
  // required to compute the block sub-matrix
  for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {
    // Declaration of the shared memory array As used to
    // store the sub-matrix of A

    // Declaration of the shared memory array Bs used to
    // store the sub-matrix of B

    // Load the matrices from device memory
    // to shared memory; each thread loads
    // one element of each matrix
    As[ty][tx] = A[a + wA * ty + tx];
    Bs[ty][tx] = B[b + wB * ty + tx];

    // Synchronize to make sure the matrices are loaded
    item_ct1.barrier(sycl::access::fence_space::local_space);

    // Multiply the two matrices together;
    // each thread computes one element
    // of the block sub-matrix
#pragma unroll

    for (int k = 0; k < BLOCK_SIZE; ++k) {
      Csub += As[ty][k] * Bs[k][tx];
    }

    // Synchronize to make sure that the preceding
    // computation is done before loading two new
    // sub-matrices of A and B in the next iteration
    item_ct1.barrier(sycl::access::fence_space::local_space);
  }

  // Write the block sub-matrix to device memory;
  // each thread writes one element
  int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
  C[c + wB * ty + tx] = Csub;
}

void ConstantInit(float *data, int size, float val) {
  for (int i = 0; i < size; ++i) {
    data[i] = val;
  }
}

/**
 * Run a simple test of matrix multiplication using CUDA
 */
int MatrixMultiply(int argc, char **argv, int block_size,
                   const sycl::range<3> &dimsA, const sycl::range<3> &dimsB) {
  sycl::device dev_ct1 = sycl::device::get_devices()[0];
  sycl::queue q_ct1(dev_ct1);
  // Allocate host memory for matrices A and B
  unsigned int size_A = dimsA[2] * dimsA[1];
  unsigned int mem_size_A = sizeof(float) * size_A;
  float *h_A;
  h_A = (float *)sycl::malloc_host(mem_size_A, q_ct1);
  unsigned int size_B = dimsB[2] * dimsB[1];
  unsigned int mem_size_B = sizeof(float) * size_B;
  float *h_B;
  h_B = (float *)sycl::malloc_host(mem_size_B, q_ct1);

  // Initialize host memory
  const float valB = 0.01f;
  ConstantInit(h_A, size_A, 1.0f);
  ConstantInit(h_B, size_B, valB);

  // Allocate device memory
  float *d_A, *d_B, *d_C;

  // Allocate host matrix C
  sycl::range<3> dimsC(1, dimsA[1], dimsB[2]);
  unsigned int mem_size_C = dimsC[2] * dimsC[1] * sizeof(float);
  float *h_C;
  h_C = (float *)sycl::malloc_host(mem_size_C, q_ct1);

  if (h_C == NULL) {
    fprintf(stderr, "Failed to allocate host matrix C!\n");
    exit(EXIT_FAILURE);
  }

  d_A = (float *)sycl::malloc_device(mem_size_A, q_ct1);
  d_B = (float *)sycl::malloc_device(mem_size_B, q_ct1);
  d_C = (float *)sycl::malloc_device(mem_size_C, q_ct1);

  // copy host memory to device
  q_ct1.memcpy(d_A, h_A, mem_size_A).wait();
  q_ct1.memcpy(d_B, h_B, mem_size_B).wait();

  // Setup execution parameters
  sycl::range<3> threads(1, block_size, block_size);
  sycl::range<3> grid(1, dimsA[1] / threads[1], dimsB[2] / threads[2]);

  // Create and start timer
  printf("Computing result using CUDA Kernel...\n");

  double start = dwalltime();
  // Performs warmup operation using matrixMul CUDA kernel
  if (block_size == 16) {
    /*
    DPCT1049:2: The work-group size passed to the SYCL kernel may exceed the
    limit. To get the device limit, query info::device::max_work_group_size.
    Adjust the work-group size if needed.
    */
    q_ct1.submit([&](sycl::handler &cgh) {
      sycl::accessor<float, 2, sycl::access::mode::read_write,
                     sycl::access::target::local>
          As_acc_ct1(sycl::range<2>(16, 16), cgh);

      sycl::accessor<float, 2, sycl::access::mode::read_write,
                      sycl::access::target::local>
          Bs_acc_ct1(sycl::range<2>(16, 16), cgh);

      cgh.parallel_for(sycl::nd_range<3>(grid * threads, threads),
                       [=](sycl::nd_item<3> item_ct1) {
                         MatrixMulCUDA<16>(d_C, d_A, d_B, dimsA[2], dimsB[2],
                                           item_ct1, As_acc_ct1, Bs_acc_ct1);
                       });
    });
  } else {
    /*
    DPCT1049:3: The work-group size passed to the SYCL kernel may exceed the
    limit. To get the device limit, query info::device::max_work_group_size.
    Adjust the work-group size if needed.
    */
    q_ct1.submit([&](sycl::handler &cgh) {
      sycl::accessor<float, 2, sycl::access::mode::read_write,
                     sycl::access::target::local>
          As_acc_ct1(sycl::range<2>(32, 32), cgh);

      sycl::accessor<float, 2, sycl::access::mode::read_write,
                      sycl::access::target::local>
          Bs_acc_ct1(sycl::range<2>(32, 32), cgh);

      cgh.parallel_for(sycl::nd_range<3>(grid * threads, threads),
                       [=](sycl::nd_item<3> item_ct1) {
                         MatrixMulCUDA<32>(d_C, d_A, d_B, dimsA[2], dimsB[2],
                                           item_ct1, As_acc_ct1, Bs_acc_ct1);
                       });
    });
  }

  q_ct1.wait_and_throw();
  double stop = dwalltime();

  printf("done %lf\n", stop - start);

  // Record the start event
 start = dwalltime();

  // Execute the kernel
  int nIter = 10;

  for (int j = 0; j < nIter; j++) {
    if (block_size == 16) {
      /*
      DPCT1049:4: The work-group size passed to the SYCL kernel may exceed the
      limit. To get the device limit, query info::device::max_work_group_size.
      Adjust the work-group size if needed.
      */
      q_ct1.submit([&](sycl::handler &cgh) {
        sycl::accessor<float, 2, sycl::access::mode::read_write,
                       sycl::access::target::local>
            As_acc_ct1(sycl::range<2>(16, 16), cgh);

        sycl::accessor<float, 2, sycl::access::mode::read_write,
                        sycl::access::target::local>
            Bs_acc_ct1(sycl::range<2>(16, 16), cgh);

        cgh.parallel_for(sycl::nd_range<3>(grid * threads, threads),
                         [=](sycl::nd_item<3> item_ct1) {
                           MatrixMulCUDA<16>(d_C, d_A, d_B, dimsA[2], dimsB[2],
                                             item_ct1, As_acc_ct1, Bs_acc_ct1);
                         });
      });
    } else {
      /*
      DPCT1049:5: The work-group size passed to the SYCL kernel may exceed the
      limit. To get the device limit, query info::device::max_work_group_size.
      Adjust the work-group size if needed.
      */
      q_ct1.submit([&](sycl::handler &cgh) {
        sycl::accessor<float, 2, sycl::access::mode::read_write,
                       sycl::access::target::local>
            As_acc_ct1(sycl::range<2>(32, 32), cgh);

        sycl::accessor<float, 2, sycl::access::mode::read_write,
                        sycl::access::target::local>
            Bs_acc_ct1(sycl::range<2>(32, 32), cgh);

        cgh.parallel_for(sycl::nd_range<3>(grid * threads, threads),
                         [=](sycl::nd_item<3> item_ct1) {
                           MatrixMulCUDA<32>(d_C, d_A, d_B, dimsA[2], dimsB[2],
                                             item_ct1, As_acc_ct1, Bs_acc_ct1);
                         });
      });
    }
  }

  q_ct1.wait_and_throw();

  stop = dwalltime();

  float msecTotal = (stop - start) * 1000;

  // Compute and print the performance
  float msecPerMatrixMul = msecTotal / nIter;
  double flopsPerMatrixMul = 2.0 * static_cast<double>(dimsA[2]) *
                             static_cast<double>(dimsA[1]) *
                             static_cast<double>(dimsB[2]);
  double gigaFlops =
      (flopsPerMatrixMul * 1.0e-9f) / (msecPerMatrixMul / 1000.0f);
  printf("Performance= %.2f GFlop/s, Time= %.3f msec, Size= %.0f Ops,"
         " WorkgroupSize= %u threads/block\n",
         gigaFlops, msecPerMatrixMul, flopsPerMatrixMul,
         threads[2] * threads[1]);

  // Copy result from device to host
  q_ct1.memcpy(h_C, d_C, mem_size_C).wait();

  q_ct1.wait_and_throw();

  printf("Checking computed result for correctness: ");
  bool correct = true;

  // test relative error by the formula
  //     |<x, y>_cpu - <x,y>_gpu|/<|x|, |y|>  < eps
  double eps = 1.e-6; // machine zero

  for (int i = 0; i < static_cast<int>(dimsC[2] * dimsC[1]); i++) {
    double abs_err = fabs(h_C[i] - (dimsA[2] * valB));
    double dot_length = dimsA[2];
    double abs_val = fabs(h_C[i]);
    double rel_err = abs_err / abs_val / dot_length;

    if (rel_err > eps) {
      printf("Error! Matrix[%05d]=%.8f, ref=%.8f error term is > %E\n", i,
             h_C[i], dimsA[2] * valB, eps);
      correct = false;
    }
  }

  printf("%s\n", correct ? "Result = PASS" : "Result = FAIL");

  // Clean up memory
  sycl::free(h_A, q_ct1);
  sycl::free(h_B, q_ct1);
  sycl::free(h_C, q_ct1);
  sycl::free(d_A, q_ct1);
  sycl::free(d_B, q_ct1);
  sycl::free(d_C, q_ct1);
  printf("\nNOTE: The CUDA Samples are not meant for performance "
         "measurements. Results may vary when GPU Boost is enabled.\n");

  if (correct) {
    return EXIT_SUCCESS;
  } else {
    return EXIT_FAILURE;
  }
}

/**
 * Program main
 */
int main(int argc, char **argv) {
  printf("[Matrix Multiply Using CUDA] - Starting...\n");

  sycl::queue q{sycl::default_selector{}};

  // Obtener el nombre del dispositivo
  std::string device_name = q.get_device().get_info<sycl::info::device::name>();

  std::cout << "Device Name: " << device_name << std::endl;

  int block_size = 16;

  sycl::range<3> dimsA(1, 5 * 2 * block_size, 5 * 2 * block_size);
  sycl::range<3> dimsB(1, 5 * 2 * block_size, 5 * 4 * block_size);

  // width of Matrix A from argc / argv directetly
  dimsA[2] = argc > 1 ? atoi(argv[1]) : dimsA[2];
  dimsA[1] = argc > 2 ? atoi(argv[2]) : dimsA[1];
  dimsB[2] = argc > 3 ? atoi(argv[3]) : dimsB[2];
  dimsB[1] = argc > 4 ? atoi(argv[4]) : dimsB[1];

  if (dimsA[2] != dimsB[1]) {
    printf("Error: outer matrix dimensions must be equal. (%d != %d)\n",
           dimsA[2], dimsB[1]);
    exit(EXIT_FAILURE);
  }

  printf("MatrixA(%d,%d), MatrixB(%d,%d)\n", dimsA[2], dimsA[1], dimsB[2],
         dimsB[1]);

  int matrix_result = MatrixMultiply(argc, argv, block_size, dimsA, dimsB);

  exit(matrix_result);
}

@ManuelCostanzo V100 GPU ?

@jinz2014 I tried on RTX 2070, RTX3070, RTX 3090, V100, GTX 980 and GTX1080. For some reason, the only GPU that gets similar times is gtx1080, then, in the others, CUDA results in better times (and of course gflops)

I couldn't reproduce the significant performance gap on a V100. The number of iterations is 10 for both programs. You observed gaps on many GPUs. Curious if there is any change in execution time when 2D memory accesses are converted to 1D memory accesses for the shared memory in the SYCL program.

./main 16384 16384 16384 16384 [Matrix Multiply Using CUDA] - Starting... Device Name: Tesla V100-DGXS-32GB MatrixA(16384,16384), MatrixB(16384,16384) Computing result using CUDA Kernel... done 3.271226 Performance= 2856.27 GFlop/s, Time= 3079.578 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS

./main 16384 16384 16384 16384 [Matrix Multiply Using SYCL] - Starting... Device Name: Tesla V100-DGXS-32GB MatrixA(16384,16384), MatrixB(16384,16384) Computing result using SYCL Kernel... done 3.248668 Performance= 2712.60 GFlop/s, Time= 3242.683 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS

Hey, sorry the code I pasted was wrong because in SYCL nIter variable was 1 instead of 10, I updated that.

Here are my results for RTX 3090:

CUDA_VISIBLE_DEVICES=0 ./mm_cu 16384 16384 16384 16384 [Matrix Multiply Using CUDA] - Starting... Device Name: NVIDIA GeForce RTX 3090 MatrixA(16384,16384), MatrixB(16384,16384) Computing result using CUDA Kernel... done 3.126466 Performance= 2933.73 GFlop/s, Time= 2998.263 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS

CUDA_VISIBLE_DEVICES=0 SYCL_DEVICE_FILTER=gpu ./mm_sycl 16384 16384 16384 16384 [Matrix Multiply Using SYCL] - Starting... Device Name: NVIDIA GeForce RTX 3090 MatrixA(16384,16384), MatrixB(16384,16384) Computing result using SYCL Kernel... done 4.167054 Performance= 2099.78 GFlop/s, Time= 4189.050 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS

Here are the results with V100:

CUDA_VISIBLE_DEVICES=1 ./mm_cu 16384 16384 16384 16384 [Matrix Multiply Using CUDA] - Starting... Device Name: Tesla V100-PCIE-32GB MatrixA(16384,16384), MatrixB(16384,16384) Computing result using CUDA Kernel... done 3.377448 Performance= 2735.60 GFlop/s, Time= 3215.415 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS

CUDA_VISIBLE_DEVICES=1 SYCL_DEVICE_FILTER=gpu ./mm_sycl 16384 16384 16384 16384 [Matrix Multiply Using SYCL] - Starting... Device Name: Tesla V100-PCIE-32GB MatrixA(16384,16384), MatrixB(16384,16384) Computing result using SYCL Kernel... done 3.417807 Performance= 2580.01 GFlop/s, Time= 3409.328 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS

Is true that the difference is less in V100, but in the RTX is ~900gflops that is a lot in my opinion, just for a matrix multiplication. What do you think?

For 50 iters in V100 I'm getting 3.8 seg per iter in SYCL and 3.2 in CUDA, ~600msec isn't a little number I think:

[Matrix Multiply Using CUDA] - Starting... Device Name: Tesla V100-PCIE-32GB MatrixA(16384,16384), MatrixB(16384,16384) Computing result using CUDA Kernel... done 3.382438 Performance= 2681.07 GFlop/s, Time= 3280.812 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS

[Matrix Multiply Using SYCL] - Starting... Device Name: Tesla V100-PCIE-32GB MatrixA(16384,16384), MatrixB(16384,16384) Computing result using SYCL Kernel... done 3.414617 Performance= 2288.59 GFlop/s, Time= 3843.451 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PAS

"RTX is ~900gflops that is a lot in my opinion/~600msec isn't a little number" Yes, I will try to run programs on 3090.

Excellent, because I'm a bit lost with this. I would expect the performance to be the same, because the code is not a complex one. I am suspicious about shared memory, but idk how to determine that doing a profiling

Can you try using accessor::get_pointer() in the SYCL code? I noticed that accessors sometimes lead to slower Cuda code, I guess the compiler fails to optimize them.

Hi, I just tried, and was the same. I sent to the kernel the accessor.get_pointer() and then I received the float*, but I didn't get any changes doing that :/

I also tried using a 1D accessor, but neither, the same

Then a detailed profiling for both codes would be interesting. Run the Cuda and Sycl code both through the Nsight Compute profiler (ncu --set=full) and compare them. If, for example, the Sycl code loads more data compared to the Cuda code, then it could explain the difference.

I run in the RTX 3070 because is where I have root access.

Here are the reports just for 1 iter (I have nsys installed, I believe will work, right?)

CUDA SYCL_DEVICE_FILTER=gpu nsys profile --stats=true --trace=cuda,nvtx ./mm_cu 16384 16384 16384 16384 [4/7] Executing 'cudaapisum' stats report

Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name

 81.2       5317895575          3  1772631858.3  163084415.0       2877  5154808283  2930185475.8  cudaStreamSynchronize    
 12.1        792685438          3   264228479.3  235289108.0  230984087   326412243    53895720.1  cudaHostAlloc            
  6.6        431403952          3   143801317.3  142455426.0  129784667   159163859    14735766.0  cudaFreeHost             
  0.1          5628676          3     1876225.3    2086470.0     763250     2778956     1024167.8  cudaFree                 
  0.0          2470952          3      823650.7     808338.0     807549      855065       27208.5  cudaMalloc               
  0.0            39071          3       13023.7      13754.0       3226       22091        9453.7  cudaMemcpyAsync          
  0.0            28239          1       28239.0      28239.0      28239       28239           0.0  cudaLaunchKernel         
  0.0            18927          1       18927.0      18927.0      18927       18927           0.0  cudaStreamCreateWithFlags
  0.0              617          1         617.0        617.0        617         617           0.0  cuModuleGetLoadingMode

[5/7] Executing 'gpukernsum' stats report

Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) GridXYZ BlockXYZ Name

100.0       4821601335          1  4821601335.0  4821601335.0  4821601335  4821601335          0.0  1024 1024    1    16   16    1  void MatrixMulCUDA<(int)16>(float *, float *, float *, int, int)

[6/7] Executing 'gpumemtimesum' stats report

Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation

 67.1        333230635      2  166615317.5  166615317.5  166328177  166902458     406078.0  [CUDA memcpy HtoD]
 32.9        163077603      1  163077603.0  163077603.0  163077603  163077603          0.0  [CUDA memcpy DtoH]

[7/7] Executing 'gpumemsizesum' stats report

Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation

2147.484 2 1073.742 1073.742 1073.742 1073.742 0.000 [CUDA memcpy HtoD] 1073.742 1 1073.742 1073.742 1073.742 1073.742 0.000 [CUDA memcpy DtoH]

SYCL SYCL_DEVICE_FILTER=gpu nsys profile --stats=true --trace=cuda,nvtx ./mm_sycl 16384 16384 16384 16384 [4/7] Executing 'cudaapisum' stats report

Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name

 81.3       7308632995          8  913579124.4        810.0        530  7308624297  2583988461.5  cuStreamSynchronize
  8.2        740074949          3  246691649.7  246157512.0  241777811   252139626     5201517.1  cuMemAllocHost_v2  
  5.5        497126323          3  165708774.3  166567995.0  163039627   167518701     2359919.7  cuEventSynchronize 
  4.8        434799197          3  144933065.7  144337695.0  130999647   159461855    14240441.4  cuMemFreeHost      
  0.1          5608288          3    1869429.3    2109918.0     744995     2753375     1025560.2  cuMemFree_v2       
  0.0          2489266          3     829755.3     812702.0     810059      866505       31853.6  cuMemAlloc_v2      
  0.0           291732          1     291732.0     291732.0     291732      291732           0.0  cuModuleLoadDataEx 
  0.0            36600          3      12200.0      11363.0       6295       18942        6364.9  cuMemcpyAsync      
  0.0            26622          4       6655.5       5169.5       3636       12647        4213.0  cuStreamCreate     
  0.0            18077          6       3012.8       1609.0        605        8521        3139.4  cuEventRecord      
  0.0            16289          1      16289.0      16289.0      16289       16289           0.0  cuLaunchKernel     
  0.0             9160          4       2290.0       1484.0       1085        5107        1907.8  cuStreamDestroy_v2 
  0.0             5768          6        961.3        603.0        368        2126         763.3  cuEventCreate      
  0.0             2630         11        239.1        122.0        105         776         236.9  cuCtxSetCurrent    
  0.0             1842          4        460.5        457.0        351         577         101.0  cuEventDestroy_v2

[5/7] Executing 'gpukernsum' stats report

Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) GridXYZ BlockXYZ Name

100.0       7308627231          1  7308627231.0  7308627231.0  7308627231  7308627231          0.0  1024 1024    1    16   16    1  Typeinfo name for MatrixMultiply(int, char **, int, const sycl::_V1::range<(int)3> &, const sycl::_…

[6/7] Executing 'gpumemtimesum' stats report

Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation

 67.2        334089782      2  167044891.0  167044891.0  166574324  167515458     665482.2  [CUDA memcpy HtoD]
 32.8        163037802      1  163037802.0  163037802.0  163037802  163037802          0.0  [CUDA memcpy DtoH]

[7/7] Executing 'gpumemsizesum' stats report

Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation

2147.484 2 1073.742 1073.742 1073.742 1073.742 0.000 [CUDA memcpy HtoD] 1073.742 1 1073.742 1073.742 1073.742 1073.742 0.000 [CUDA memcpy DtoH]

nsys gives you an application-level overview and tracks Cuda API calls, ncu gives you a kernel profile but doesn't know what's happening outside of a Cuda kernel. What would be interesting is a kernel profile, so that we can see what's happening on a kernel level.

Hey, I don't have root access, for now, to install NCU, in the meantime could someone run NCU, please? Thanks

@MarkusBuettner Hi, I could execute with NCU in the RTX 2070. I tried with 8192.

CUDA

`void MatrixMulCUDA<(int)16>(float , float , float *, int, int) (512, 512, 1)x(16, 16, 1), Context 1, Stream 13, Device 0, CC 7.5 Section: GPU Speed Of Light Throughput

Metric Name               Metric Unit  Metric Value
----------------------- ------------- -------------
DRAM Frequency          cycle/nsecond          6.79
SM Frequency            cycle/nsecond          1.41
Elapsed Cycles                  cycle    1932539592
Memory Throughput                   %         74.30
DRAM Throughput                     %         23.94
Duration                       second          1.36
L1/TEX Cache Throughput             %         95.20
L2 Cache Throughput                 %         14.60
SM Active Cycles                cycle 1927275789.33
Compute (SM) Throughput             %         74.30
----------------------- ------------- -------------

INF   Compute and Memory are well-balanced: To reduce runtime, both computation and memory traffic must be reduced. 
      Check both the Compute Workload Analysis and Memory Workload Analysis sections.                               

Section: GPU Speed Of Light Roofline Chart
INF   The ratio of peak float (fp32) to double (fp64) performance on this device is 32:1. The kernel achieved 12%   
      of this device's fp32 peak performance and 0% of its fp64 peak performance. See the Kernel Profiling Guide    
      (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#roofline) for more details on roofline      
      analysis.                                                                                                     

Section: Compute Workload Analysis
-------------------- ----------- ------------
Metric Name          Metric Unit Metric Value
-------------------- ----------- ------------
Executed Ipc Active   inst/cycle         0.77
Executed Ipc Elapsed  inst/cycle         0.77
Issue Slots Busy               %        19.37
Issued Ipc Active     inst/cycle         0.77
SM Busy                        %        26.83
-------------------- ----------- ------------

WRN   All compute pipelines are under-utilized. Either this kernel is very small or it doesn't issue enough warps   
      per scheduler. Check the Launch Statistics and Scheduler Statistics sections for further details.             

Section: Memory Workload Analysis
----------------- ------------ ------------
Metric Name        Metric Unit Metric Value
----------------- ------------ ------------
Memory Throughput Gbyte/second       103.99
Mem Busy                     %        47.60
Max Bandwidth                %        74.30
L1/TEX Hit Rate              %         0.00
L2 Hit Rate                  %        48.49
Mem Pipes Busy               %        74.30
----------------- ------------ ------------

Section: Memory Workload Analysis Tables
WRN   The memory access pattern for loads from L1TEX to L2 is not optimal. The granularity of an L1TEX request to   
      L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only  
      accesses an average of 2.0 sectors out of the possible 4 sectors per cache line. Check the Source Counters    
      section for uncoalesced loads and try to minimize how many cache lines need to be accessed per memory         
      request.                                                                                                      
----- --------------------------------------------------------------------------------------------------------------
WRN   The memory access pattern for stores from L1TEX to L2 is not optimal. The granularity of an L1TEX request to  
      L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only  
      accesses an average of 2.0 sectors out of the possible 4 sectors per cache line. Check the Source Counters    
      section for uncoalesced stores and try to minimize how many cache lines need to be accessed per memory        
      request.                                                                                                      

Section: Scheduler Statistics
---------------------------- ----------- ------------
Metric Name                  Metric Unit Metric Value
---------------------------- ----------- ------------
One or More Eligible                   %        19.37
Issued Warp Per Scheduler                        0.19
No Eligible                            %        80.63
Active Warps Per Scheduler          warp         8.00
Eligible Warps Per Scheduler        warp         0.61
---------------------------- ----------- ------------

WRN   Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only      
      issues an instruction every 5.2 cycles. This might leave hardware resources underutilized and may lead to     
      less optimal performance. Out of the maximum of 8 warps per scheduler, this kernel allocates an average of    
      8.00 active warps per scheduler, but only an average of 0.61 warps were eligible per cycle. Eligible warps    
      are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible   
      warp results in no instruction being issued and the issue slot remains unused. To increase the number of      
      eligible warps, avoid possible load imbalances due to highly different execution durations per warp.          
      Reducing stalls indicated on the Warp State Statistics and Source Counters sections can help, too.            

Section: Warp State Statistics
---------------------------------------- ----------- ------------
Metric Name                              Metric Unit Metric Value
---------------------------------------- ----------- ------------
Warp Cycles Per Issued Instruction             cycle        41.29
Warp Cycles Per Executed Instruction           cycle        41.29
Avg. Active Threads Per Warp                                   32
Avg. Not Predicated Off Threads Per Warp                    32.00
---------------------------------------- ----------- ------------

WRN   On average, each warp of this kernel spends 19.8 cycles being stalled waiting for an MIO instruction queue to 
      be not full. This represents about 47.9% of the total average of 41.3 cycles between issuing two              
      instructions. This stall reason is high in cases of utilization of the MIO pipelines, which include special   
      math instructions, dynamic branches, as well as shared memory instructions. When caused by shared memory      
      accesses, trying to use fewer but wider loads can reduce pipeline pressure.                                   
----- --------------------------------------------------------------------------------------------------------------
INF   Check the Source Counters section for the top stall locations in your source based on sampling data. The      
      Kernel Profiling Guide (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sampling) provides   
      more details on each stall reason.                                                                            

Section: Instruction Statistics
---------------------------------------- ----------- ------------
Metric Name                              Metric Unit Metric Value
---------------------------------------- ----------- ------------
Avg. Executed Instructions Per Scheduler        inst 373278492.44
Executed Instructions                           inst  53752102912
Avg. Issued Instructions Per Scheduler          inst 373278530.94
Issued Instructions                             inst  53752108456
---------------------------------------- ----------- ------------

Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name                          Metric Unit    Metric Value
-------------------------------- --------------- ---------------
Block Size                                                   256
Function Cache Configuration                     CachePreferNone
Grid Size                                                 262144
Registers Per Thread             register/thread              39
Shared Memory Configuration Size           Kbyte           32.77
Driver Shared Memory Per Block        byte/block               0
Dynamic Shared Memory Per Block       byte/block               0
Static Shared Memory Per Block       Kbyte/block            2.05
Threads                                   thread        67108864
Waves Per SM                                             1820.44
-------------------------------- --------------- ---------------

Section: Occupancy
------------------------------- ----------- ------------
Metric Name                     Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM                        block           16
Block Limit Registers                 block            6
Block Limit Shared Mem                block           16
Block Limit Warps                     block            4
Theoretical Active Warps per SM        warp           32
Theoretical Occupancy                     %          100
Achieved Occupancy                        %        99.96
Achieved Active Warps Per SM           warp        31.99
------------------------------- ----------- ------------

INF   This kernel's theoretical occupancy is not impacted by any block limit.                                       

Section: Source Counters
------------------------- ----------- ------------
Metric Name               Metric Unit Metric Value
------------------------- ----------- ------------
Branch Instructions Ratio           %         0.02
Branch Instructions              inst   1077936128
Branch Efficiency                   %          100
Avg. Divergent Branches                          0
------------------------- ----------- ------------

SYCL `NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. ==PROF== Disconnected from process 1235 [1235] mm_sycl@127.0.0.1 Typeinfo name for MatrixMultiply(int, char **, int, const sycl::_V1::range<(int)3> &, const sycl::_V1::range<(int)3> &)::[lambda(sycl::_V1::handler &) (instance 1)]::operator ()(sycl::_V1::handler &) const::[lambda(sycl::_V1::nd_item<(int)3>) (instance 1)] (512, 512, 1)x(16, 16, 1), Context 1, Stream 15, Device 0, CC 7.5 Section: GPU Speed Of Light Throughput

Metric Name               Metric Unit  Metric Value
----------------------- ------------- -------------
DRAM Frequency          cycle/nsecond          6.79
SM Frequency            cycle/nsecond          1.41
Elapsed Cycles                  cycle    2449660647
Memory Throughput                   %         87.91
DRAM Throughput                     %         18.48
Duration                       second          1.73
L1/TEX Cache Throughput             %         94.62
L2 Cache Throughput                 %         11.52
SM Active Cycles                cycle 2443178755.81
Compute (SM) Throughput             %         87.91
----------------------- ------------- -------------

INF   The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To   
      further improve performance, work will likely need to be shifted from the most utilized to another unit.      
      Start by analyzing workloads in the Compute Workload Analysis section.                                        

Section: GPU Speed Of Light Roofline Chart
INF   The ratio of peak float (fp32) to double (fp64) performance on this device is 32:1. The kernel achieved 10%   
      of this device's fp32 peak performance and 0% of its fp64 peak performance. See the Kernel Profiling Guide    
      (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#roofline) for more details on roofline      
      analysis.                                                                                                     

Section: Compute Workload Analysis
-------------------- ----------- ------------
Metric Name          Metric Unit Metric Value
-------------------- ----------- ------------
Executed Ipc Active   inst/cycle         0.76
Executed Ipc Elapsed  inst/cycle         0.76
Issue Slots Busy               %        18.95
Issued Ipc Active     inst/cycle         0.76
SM Busy                        %        30.93
-------------------- ----------- ------------

WRN   All compute pipelines are under-utilized. Either this kernel is very small or it doesn't issue enough warps   
      per scheduler. Check the Launch Statistics and Scheduler Statistics sections for further details.             

Section: Memory Workload Analysis
----------------- ------------ ------------
Metric Name        Metric Unit Metric Value
----------------- ------------ ------------
Memory Throughput Gbyte/second        80.36
Mem Busy                     %        47.31
Max Bandwidth                %        87.91
L1/TEX Hit Rate              %         0.00
L2 Hit Rate                  %        49.65
Mem Pipes Busy               %        87.91
----------------- ------------ ------------

Section: Memory Workload Analysis Tables
WRN   The memory access pattern for loads from L1TEX to L2 is not optimal. The granularity of an L1TEX request to   
      L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only  
      accesses an average of 2.0 sectors out of the possible 4 sectors per cache line. Check the Source Counters    
      section for uncoalesced loads and try to minimize how many cache lines need to be accessed per memory         
      request.                                                                                                      
----- --------------------------------------------------------------------------------------------------------------
WRN   The memory access pattern for stores from L1TEX to L2 is not optimal. The granularity of an L1TEX request to  
      L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only  
      accesses an average of 2.0 sectors out of the possible 4 sectors per cache line. Check the Source Counters    
      section for uncoalesced stores and try to minimize how many cache lines need to be accessed per memory        
      request.                                                                                                      

Section: Scheduler Statistics
---------------------------- ----------- ------------
Metric Name                  Metric Unit Metric Value
---------------------------- ----------- ------------
One or More Eligible                   %        18.95
Issued Warp Per Scheduler                        0.19
No Eligible                            %        81.05
Active Warps Per Scheduler          warp         8.00
Eligible Warps Per Scheduler        warp         0.71
---------------------------- ----------- ------------

WRN   Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only      
      issues an instruction every 5.3 cycles. This might leave hardware resources underutilized and may lead to     
      less optimal performance. Out of the maximum of 8 warps per scheduler, this kernel allocates an average of    
      8.00 active warps per scheduler, but only an average of 0.71 warps were eligible per cycle. Eligible warps    
      are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible   
      warp results in no instruction being issued and the issue slot remains unused. To increase the number of      
      eligible warps, avoid possible load imbalances due to highly different execution durations per warp.          
      Reducing stalls indicated on the Warp State Statistics and Source Counters sections can help, too.            

Section: Warp State Statistics
---------------------------------------- ----------- ------------
Metric Name                              Metric Unit Metric Value
---------------------------------------- ----------- ------------
Warp Cycles Per Issued Instruction             cycle        42.20
Warp Cycles Per Executed Instruction           cycle        42.20
Avg. Active Threads Per Warp                                   32
Avg. Not Predicated Off Threads Per Warp                    32.00
---------------------------------------- ----------- ------------

WRN   On average, each warp of this kernel spends 22.7 cycles being stalled waiting for an MIO instruction queue to 
      be not full. This represents about 53.8% of the total average of 42.2 cycles between issuing two              
      instructions. This stall reason is high in cases of utilization of the MIO pipelines, which include special   
      math instructions, dynamic branches, as well as shared memory instructions. When caused by shared memory      
      accesses, trying to use fewer but wider loads can reduce pipeline pressure.                                   
----- --------------------------------------------------------------------------------------------------------------
INF   Check the Source Counters section for the top stall locations in your source based on sampling data. The      
      Kernel Profiling Guide (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sampling) provides   
      more details on each stall reason.                                                                            

Section: Instruction Statistics
---------------------------------------- ----------- ------------
Metric Name                              Metric Unit Metric Value
---------------------------------------- ----------- ------------
Avg. Executed Instructions Per Scheduler        inst 463048248.89
Executed Instructions                           inst  66678947840
Avg. Issued Instructions Per Scheduler          inst 463048322.28
Issued Instructions                             inst  66678958409
---------------------------------------- ----------- ------------

Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name                          Metric Unit    Metric Value
-------------------------------- --------------- ---------------
Block Size                                                   256
Function Cache Configuration                     CachePreferNone
Grid Size                                                 262144
Registers Per Thread             register/thread              51
Shared Memory Configuration Size           Kbyte           32.77
Driver Shared Memory Per Block        byte/block               0
Dynamic Shared Memory Per Block      Kbyte/block            2.05
Static Shared Memory Per Block        byte/block               0
Threads                                   thread        67108864
Waves Per SM                                             1820.44
-------------------------------- --------------- ---------------

Section: Occupancy
------------------------------- ----------- ------------
Metric Name                     Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM                        block           16
Block Limit Registers                 block            4
Block Limit Shared Mem                block           16
Block Limit Warps                     block            4
Theoretical Active Warps per SM        warp           32
Theoretical Occupancy                     %          100
Achieved Occupancy                        %        99.97
Achieved Active Warps Per SM           warp        31.99
------------------------------- ----------- ------------

INF   This kernel's theoretical occupancy is not impacted by any block limit.                                       

Section: Source Counters
------------------------- ----------- ------------
Metric Name               Metric Unit Metric Value
------------------------- ----------- ------------
Branch Instructions Ratio           %         0.02
Branch Instructions              inst   1077936128
Branch Efficiency                   %          100
Avg. Divergent Branches                          0
------------------------- ----------- ------------

RTX 3070, 16384 size

CUDA Section: GPU Speed Of Light Throughput

Metric Name               Metric Unit   Metric Value
----------------------- ------------- --------------
DRAM Frequency          cycle/nsecond           6.79
SM Frequency            cycle/nsecond           1.50
Elapsed Cycles                  cycle    11281589358
Memory Throughput                   %          79.46
DRAM Throughput                     %          38.08
Duration                       second           7.52
L1/TEX Cache Throughput             %          79.47
L2 Cache Throughput                 %          20.05
SM Active Cycles                cycle 11280132003.54
Compute (SM) Throughput             %          79.46
----------------------- ------------- --------------

INF   Compute and Memory are well-balanced: To reduce runtime, both computation and memory traffic must be reduced. 
      Check both the Compute Workload Analysis and Memory Workload Analysis sections.                               

Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name                          Metric Unit    Metric Value
-------------------------------- --------------- ---------------
Block Size                                                   256
Function Cache Configuration                     CachePreferNone
Grid Size                                                1048576
Registers Per Thread             register/thread              38
Shared Memory Configuration Size           Kbyte           32.77
Driver Shared Memory Per Block       Kbyte/block            1.02
Dynamic Shared Memory Per Block       byte/block               0
Static Shared Memory Per Block       Kbyte/block            2.05
Threads                                   thread       268435456
Waves Per SM                                             3799.19
-------------------------------- --------------- ---------------

Section: Occupancy
------------------------------- ----------- ------------
Metric Name                     Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM                        block           16
Block Limit Registers                 block            6
Block Limit Shared Mem                block           10
Block Limit Warps                     block            6
Theoretical Active Warps per SM        warp           48
Theoretical Occupancy                     %          100
Achieved Occupancy                        %       100.01
Achieved Active Warps Per SM           warp        48.01
------------------------------- ----------- ------------

SYCL Section: GPU Speed Of Light Throughput

Metric Name               Metric Unit   Metric Value
----------------------- ------------- --------------
DRAM Frequency          cycle/nsecond           6.79
SM Frequency            cycle/nsecond           1.50
Elapsed Cycles                  cycle    13643340229
Memory Throughput                   %          98.56
DRAM Throughput                     %          29.76
Duration                       second           9.10
L1/TEX Cache Throughput             %          98.60
L2 Cache Throughput                 %          16.60
SM Active Cycles                cycle 13638202932.96
Compute (SM) Throughput             %          98.56
----------------------- ------------- --------------

INF   The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To   
      further improve performance, work will likely need to be shifted from the most utilized to another unit.      
      Start by analyzing workloads in the Compute Workload Analysis section.                                        

Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name                          Metric Unit    Metric Value
-------------------------------- --------------- ---------------
Block Size                                                   256
Function Cache Configuration                     CachePreferNone
Grid Size                                                1048576
Registers Per Thread             register/thread              40
Shared Memory Configuration Size           Kbyte           32.77
Driver Shared Memory Per Block       Kbyte/block            1.02
Dynamic Shared Memory Per Block      Kbyte/block            2.05
Static Shared Memory Per Block        byte/block               0
Threads                                   thread       268435456
Waves Per SM                                             3799.19
-------------------------------- --------------- ---------------

Section: Occupancy
------------------------------- ----------- ------------
Metric Name                     Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM                        block           16
Block Limit Registers                 block            6
Block Limit Shared Mem                block           10
Block Limit Warps                     block            6
Theoretical Active Warps per SM        warp           48
Theoretical Occupancy                     %          100
Achieved Occupancy                        %       100.04
Achieved Active Warps Per SM           warp        48.02
------------------------------- ----------- ------------

Could you list the "ncu" command for profiling the SYCL program ?

Yes, I did the same for both SYCL and CUDA:

SYCL_DEVICE_FILTER=gpu /opt/nvidia/nsight-compute/2022.4.0/ncu ./mm_sycl 16384 16384 16384 16384

It is maybe helpful to generate a ncu-rep file:

SYCL_DEVICE_FILTER=gpu /opt/nvidia/nsight-compute/2022.4.0/ncu --set=full -o mm_sycl ./mm_sycl 16384 16384 16384 16384

You can view them in ncu-ui even if you don't have a Cuda-capable device on your machine.

But from the logs you posted it looks like the SYCL kernel is executing about 24% more instructions. Now it is interesting, why...

Also, is there a reason why you're using dimension 1/2 instead of 0/1 in the SYCL kernel?

Hi @MarkusBuettner, There is no reasons at all. I tried with 0/1 dimension but it was the same.

I posted also in this forum: https://support.codeplay.com/t/poor-performance-on-matrix-multiplication/575 and I got an answer that could solve the mystery? The thing is why

Ah, that's interesting. I'm interested in why the clang compiler did not optimize this.

On the other hand, I've noticed that Open SYCL (who also use clang in the background) sometimes generates slower code compared to Intel/Codeplay, probably because they use different flags?

I tried with OpenSYCL but I had the same performance behavior. Maybe there is something inside clang that all of the implementations are sharing

Hi. I've been passed this by my colleague, and at first glance I'm inclined to agree with rbielski's initial assessment on the Codeplay forums, but I'll let you know what I find.

As we've suggested on the Codeplay forum, we suggest not using local accessors and instead using the Intel extension for local memory.

In short, the compiler benefits significantly from knowing the compile-time bounds of the local allocation in this case, as it can statically know the constant offsets for all memory accesses in the following offending loop:

  for (int k = 0; k < BLOCK_SIZE; ++k) {
    Csub += As[ty][k] * Bs[k][tx];
  }

Using local accessors does not convey this information across the host/device boundary.

In MatrixMulCUDA, remove the local_accessor parameters As and Bs and instead declare the following:

using namespace sycl::ext::oneapi;
auto& As = *group_local_memory_for_overwrite<float[BLOCK_SIZE][BLOCK_SIZE]>(item_ct1.get_group());
auto& Bs = *group_local_memory_for_overwrite<float[BLOCK_SIZE][BLOCK_SIZE]>(item_ct1.get_group());

This should be enough to match the performance of CUDA - it does locally.

intel / llvm

I'm getting slower performance than CUDA for Matrix Multiplication #8971