Open Infinoid opened 3 years ago
Ah, I never understood why the CUDA code generator decided to emit break
for a Continue
operation. Do you think you could post the full generated code here? In my mind, if the for loop being continue'd out of is parallelized over GPU threads, then we would want to return for a continue.
cc @fredrikbk it seems like we should be running CI tests on nodes that have GPU's, rather than finding bugs like these after the fact. Is there a reason we don't (money)?
Sure.
// Generated by the Tensor Algebra Compiler (tensor-compiler.org)
__global__
void computeDeviceKernel0(taco_tensor_t * __restrict__ A, taco_tensor_t * __restrict__ B, taco_tensor_t * __restrict__ C){
double* __restrict__ A_vals = (double*)(A->vals);
int* __restrict__ B2_pos = (int*)(B->indices[1][0]);
int* __restrict__ B2_crd = (int*)(B->indices[1][1]);
double* __restrict__ B_vals = (double*)(B->vals);
int C2_dimension = (int)(C->dimensions[1]);
double* __restrict__ C_vals = (double*)(C->vals);
int32_t i125 = blockIdx.x;
int32_t i126 = (threadIdx.x % (256));
if (threadIdx.x >= 256) {
return;
}
int32_t i = i125 * 256 + i126;
int32_t iB = i * 5;
int32_t iC = i * 5;
if (i >= 2)
return;
int32_t j = 0;
int32_t jB = taco_binarySearchAfter(B2_crd, (uint64_t) B2_pos[iB], (uint64_t) B2_pos[(iB + 1)], (uint64_t) 0);
int32_t pB2_end = B2_pos[(iB + 1)];
while (jB < pB2_end) {
int32_t jB2_window = B2_crd[jB];
if (jB2_window % 5 != 0) {
jB = jB + 1;
return; // ← THIS IS THE PROBLEM
}
int32_t jB0 = jB2_window / 5;
if (jB0 >= 10)
break;
int32_t jC = iC * C2_dimension + j * 5;
if (jB0 == j) {
int32_t jA = i * 2 + j;
A_vals[jA] = B_vals[jB] + C_vals[jC];
}
else {
int32_t jA = i * 2 + j;
A_vals[jA] = C_vals[jC];
}
jB = jB + (int32_t)(jB0 == j);
j = j + 1;
}
while (j < 2 && j >= 0) {
int32_t jC = iC * C2_dimension + j * 5;
int32_t jA = i * 2 + j;
A_vals[jA] = C_vals[jC];
j = j + 1;
}
}
int compute(taco_tensor_t *A, taco_tensor_t *B, taco_tensor_t *C) {
double* __restrict__ A_vals = (double*)(A->vals);
for (int32_t pA = 0; pA < 4; pA++) {
A_vals[pA] = 0.0;
}
computeDeviceKernel0<<<1, 256>>>(A, B, C);
cudaDeviceSynchronize();
return 0;
}
I generated this by applying your nice #419 patch and running the following:
bin/taco "A(i,j)=B(i(0,10,5),j(0,10,5))+C(i(0,10,5),j(0,10,5))" -d=A:2,2 -f=B:ds -f=C:dd -cuda -print-nocolor
Search for THIS IS THE PROBLEM
to find the problematic return statement.
I see. In this case we definitely shouldn't be returning.
In my mind, if the for loop being continue'd out of is parallelized over GPU threads, then we would want to return for a continue.
I think that this makes sense for the code generator to do. I think the last thing to figure out is why / when break
statements get generated by the cuda code generator. If we know what the use is there then we can decide when continue should actuall be mapped to continue. @stephenchouca do you know what the deal with this is?
@rohany This is the CUDA regression issue I mentioned in the TACO meeting today. From #456 and many of the CUDA builds in the Action tab, the following tests still fail:
The following tests FAILED:
183 - */stride.windowing/* (Failed)
184 - */indexSetVectors.windowing/* (Failed)
185 - */indexSetMatrices.windowing/* (Failed)
Could you take a look?
Any progress on this fix ? I pulled the most recent version of TACO yesterday and these test are still failing. Is there a PR or branch with the fix implemented that I could try out ?
Hi David, thank you for your interest! Unfortunately, I don't think we have any students actively working on TACO at the moment, so I'm unsure when this will get fixed. I have other responsibilities and cannot take the time to fix this right now.
Hi Rohan,
No worries ... is TACO still the sota for tensor compilers or do you have other projects that are meant as successors ?
For context, I'm working on some elastodynamics simulation and would love to have a TACO-like backend for compiling to CPU and GPU
In terms of sparse tensor computations, TACO is still the state of the art (# of supported features). However, the MLIR sparse dialect (https://mlir.llvm.org/docs/Dialects/SparseTensorOps/) supports a subset of TACO's features in a Google-funded library. I've never used it, but I assume it is quite robust.
For context, I'm working on some elastodynamics simulation and would love to have a TACO-like backend for compiling to CPU and GPU
What sort of kernels are you trying to run?
The big bottlenecks are assembly type operations over the FE mesh and iterative solvers like preconditioned CG.
So breaking this down a bit more, what kind of assembly are you doing? What are the input and output types of tensors? TACO-like tensor compilers can do matrix assembly, but many cases do not parallelize well, and would be difficult to target GPUs directly with TACO.
Depending on how complicated the pre-conditioner is, my experience with CG solvers is that there aren't too many complicated tensor expressions present (SpMV, axpy, dot). Do you actually need tensor operations, or is mostly matrix operations the target? If so, I have some other libraries I can suggest.
right that's true, mostly dense and sparse matrix operations would already be extremely helpful
Have you looked into CuPy at all? It will help you target a single GPU. If you have a larger problem that can utilize multiple GPUs, take a look at https://github.com/nv-legate/legate.sparse (shameless self-plug).
oh I haven't seen this, thanks ! I'll take a look :) (sorry for hijacking this issues thread)
The windowing stride syntax doesn't work quite right when configured with
-DCUDA=ON
.These 3 test cases fail:
I dug into this a bit, and the stride syntax causes taco to emit C code like this:
But the CUDA codegen doesn't emit
continue
for aContinue
op. Instead it returns, like this:Returning is wrong in this case, it effectively drops the remainder of the thread's work on the floor, producing incorrect output. Changing that
return
to acontinue
fixes the problem. Adding a flag to force it to emitcontinue
in just this one case allows the tests to pass.I cooked up a hacky workaround which does that. I don't think it's quite right. It just forces the CUDA codegen to emit a continue, rather than giving the CUDA codegen enough info to decide for itself. But hopefully it illustrates the problem.
Cc: @rohany