striding syntax fails tests in cuda builds

Infinoid commented 3 years ago

The windowing stride syntax doesn't work quite right when configured with -DCUDA=ON.

These 3 test cases fail:

    890 - windowing/stride.windowing/(dense,compressed) (Failed)
    891 - windowing/stride.windowing/(compressed,dense) (Failed)
    892 - windowing/stride.windowing/(compressed,compressed) (Failed)

I dug into this a bit, and the stride syntax causes taco to emit C code like this:

    if (jB2_window % 5 != 0) {
      jB++;
      continue;
    }

But the CUDA codegen doesn't emit continue for a Continue op. Instead it returns, like this:

    if (jB2_window % 5 != 0) {
      jB = jB + 1;
      return;
    }

Returning is wrong in this case, it effectively drops the remainder of the thread's work on the floor, producing incorrect output. Changing that return to a continue fixes the problem. Adding a flag to force it to emit continue in just this one case allows the tests to pass.

I cooked up a hacky workaround which does that. I don't think it's quite right. It just forces the CUDA codegen to emit a continue, rather than giving the CUDA codegen enough info to decide for itself. But hopefully it illustrates the problem.

Cc: @rohany

rohany commented 3 years ago

Ah, I never understood why the CUDA code generator decided to emit break for a Continue operation. Do you think you could post the full generated code here? In my mind, if the for loop being continue'd out of is parallelized over GPU threads, then we would want to return for a continue.

cc @fredrikbk it seems like we should be running CI tests on nodes that have GPU's, rather than finding bugs like these after the fact. Is there a reason we don't (money)?

Infinoid commented 3 years ago

Sure.

// Generated by the Tensor Algebra Compiler (tensor-compiler.org)

__global__
void computeDeviceKernel0(taco_tensor_t * __restrict__ A, taco_tensor_t * __restrict__ B, taco_tensor_t * __restrict__ C){
  double* __restrict__ A_vals = (double*)(A->vals);
  int* __restrict__ B2_pos = (int*)(B->indices[1][0]);
  int* __restrict__ B2_crd = (int*)(B->indices[1][1]);
  double* __restrict__ B_vals = (double*)(B->vals);
  int C2_dimension = (int)(C->dimensions[1]);
  double* __restrict__ C_vals = (double*)(C->vals);

  int32_t i125 = blockIdx.x;
  int32_t i126 = (threadIdx.x % (256));
  if (threadIdx.x >= 256) {
    return;
  }

  int32_t i = i125 * 256 + i126;
  int32_t iB = i * 5;
  int32_t iC = i * 5;
  if (i >= 2)
    return;

  int32_t j = 0;
  int32_t jB = taco_binarySearchAfter(B2_crd, (uint64_t) B2_pos[iB], (uint64_t) B2_pos[(iB + 1)], (uint64_t) 0);
  int32_t pB2_end = B2_pos[(iB + 1)];

  while (jB < pB2_end) {
    int32_t jB2_window = B2_crd[jB];
    if (jB2_window % 5 != 0) {
      jB = jB + 1;
      return; // ← THIS IS THE PROBLEM
    }
    int32_t jB0 = jB2_window / 5;
    if (jB0 >= 10)
      break;

    int32_t jC = iC * C2_dimension + j * 5;
    if (jB0 == j) {
      int32_t jA = i * 2 + j;
      A_vals[jA] = B_vals[jB] + C_vals[jC];
    }
    else {
      int32_t jA = i * 2 + j;
      A_vals[jA] = C_vals[jC];
    }
    jB = jB + (int32_t)(jB0 == j);
    j = j + 1;
  }
  while (j < 2 && j >= 0) {
    int32_t jC = iC * C2_dimension + j * 5;
    int32_t jA = i * 2 + j;
    A_vals[jA] = C_vals[jC];
    j = j + 1;
  }
}

int compute(taco_tensor_t *A, taco_tensor_t *B, taco_tensor_t *C) {
  double* __restrict__ A_vals = (double*)(A->vals);

  for (int32_t pA = 0; pA < 4; pA++) {
    A_vals[pA] = 0.0;
  }

  computeDeviceKernel0<<<1, 256>>>(A, B, C);
  cudaDeviceSynchronize();
  return 0;
}

I generated this by applying your nice #419 patch and running the following:

bin/taco "A(i,j)=B(i(0,10,5),j(0,10,5))+C(i(0,10,5),j(0,10,5))" -d=A:2,2 -f=B:ds -f=C:dd -cuda -print-nocolor

Search for THIS IS THE PROBLEM to find the problematic return statement.

rohany commented 3 years ago

I see. In this case we definitely shouldn't be returning.

In my mind, if the for loop being continue'd out of is parallelized over GPU threads, then we would want to return for a continue.

I think that this makes sense for the code generator to do. I think the last thing to figure out is why / when break statements get generated by the cuda code generator. If we know what the use is there then we can decide when continue should actuall be mapped to continue. @stephenchouca do you know what the deal with this is?

weiya711 commented 3 years ago

@rohany This is the CUDA regression issue I mentioned in the TACO meeting today. From #456 and many of the CUDA builds in the Action tab, the following tests still fail:

The following tests FAILED:
    183 - */stride.windowing/* (Failed)
    184 - */indexSetVectors.windowing/* (Failed)
    185 - */indexSetMatrices.windowing/* (Failed)

Could you take a look?

dilevin commented 1 year ago

Any progress on this fix ? I pulled the most recent version of TACO yesterday and these test are still failing. Is there a PR or branch with the fix implemented that I could try out ?

rohany commented 1 year ago

Hi David, thank you for your interest! Unfortunately, I don't think we have any students actively working on TACO at the moment, so I'm unsure when this will get fixed. I have other responsibilities and cannot take the time to fix this right now.

dilevin commented 1 year ago

Hi Rohan,

No worries ... is TACO still the sota for tensor compilers or do you have other projects that are meant as successors ?

For context, I'm working on some elastodynamics simulation and would love to have a TACO-like backend for compiling to CPU and GPU

rohany commented 1 year ago

In terms of sparse tensor computations, TACO is still the state of the art (# of supported features). However, the MLIR sparse dialect (https://mlir.llvm.org/docs/Dialects/SparseTensorOps/) supports a subset of TACO's features in a Google-funded library. I've never used it, but I assume it is quite robust.

For context, I'm working on some elastodynamics simulation and would love to have a TACO-like backend for compiling to CPU and GPU

What sort of kernels are you trying to run?

dilevin commented 1 year ago

The big bottlenecks are assembly type operations over the FE mesh and iterative solvers like preconditioned CG.

rohany commented 1 year ago

So breaking this down a bit more, what kind of assembly are you doing? What are the input and output types of tensors? TACO-like tensor compilers can do matrix assembly, but many cases do not parallelize well, and would be difficult to target GPUs directly with TACO.

Depending on how complicated the pre-conditioner is, my experience with CG solvers is that there aren't too many complicated tensor expressions present (SpMV, axpy, dot). Do you actually need tensor operations, or is mostly matrix operations the target? If so, I have some other libraries I can suggest.

dilevin commented 1 year ago

right that's true, mostly dense and sparse matrix operations would already be extremely helpful

rohany commented 1 year ago

Have you looked into CuPy at all? It will help you target a single GPU. If you have a larger problem that can utilize multiple GPUs, take a look at https://github.com/nv-legate/legate.sparse (shameless self-plug).

dilevin commented 1 year ago

oh I haven't seen this, thanks ! I'll take a look :) (sorry for hijacking this issues thread)

tensor-compiler / taco

striding syntax fails tests in cuda builds #426