taco: bug for user-defined fused function xor(and(A, C), and(B, C))

When generating code for xor(and(A, C), and(B, C)) in taco certain implementations do not match numpy.

1) The below statement produces the correct code that matches dense numpy and pydata/sparse. For data/image/no/image1.jpg it produces 816 nnzs:

   temp1(i,j) = andOp1(A(i, j), C(i, j));
   temp2(i,j) = andOp1(B(i, j), C(i, j));
   result(i, j) = xorOp1(temp1(i,j), temp2(i,j));

2) This (supposedly) equivalent statement produces too many zeros. For data/image/no/image1.jpg it produces 2646 nnzs:

    result(i, j) = xorOp1(andOp1(matrix1(i, j), matrix3(i, j)), andOp1(matrix2(i, j), matrix3(i, j)));

3) This (supposedly) equivalent statement also produces too many zeros and produces the same result as 2). For data/image/no/image1.jpg it produces 2646 nnzs:

    result(i, j) = xorAndOp(matrix1(i, j), matrix2(i, j), matrix3(i, j));

For case 3) the generated code from the lowerer is:

int compute(taco_tensor_t *result, taco_tensor_t *A, taco_tensor_t *B, taco_tensor_t *C) {
  int result1_dimension = (int)(result->dimensions[0]);
  int64_t* restrict result_vals = (int64_t*)(result->vals);
  int A1_dimension = (int)(A->dimensions[0]);
  int* restrict A2_pos = (int*)(A->indices[1][0]);
  int* restrict A2_crd = (int*)(A->indices[1][1]);
  int B1_dimension = (int)(B->dimensions[0]);
  int* restrict B2_pos = (int*)(B->indices[1][0]);
  int* restrict B2_crd = (int*)(B->indices[1][1]);
  int C1_dimension = (int)(C->dimensions[0]);
  int* restrict C2_pos = (int*)(C->indices[1][0]);
  int* restrict C2_crd = (int*)(C->indices[1][1]);
  int32_t jresult = 0;
  for (int32_t i = 0; i < C1_dimension; i++) {
    int32_t jA = A2_pos[i];
    int32_t pA2_end = A2_pos[(i + 1)];
    int32_t jC = C2_pos[i];
    int32_t pC2_end = C2_pos[(i + 1)];
    int32_t jB = B2_pos[i];
    int32_t pB2_end = B2_pos[(i + 1)];
    while ((jA < pA2_end && jC < pC2_end) && jB < pB2_end) {
      int32_t jA0 = A2_crd[jA];
      int32_t jC0 = C2_crd[jC];
      int32_t jB0 = B2_crd[jB];
      int32_t j = TACO_MIN(jA0,TACO_MIN(jC0,jB0));
      if ((jA0 == j && jC0 == j) && jB0 == j) {
        result_vals[jresult] = 1;
        jresult++;
      }
      else if (jB0 == j && jC0 == j) {
        result_vals[jresult] = 1;
        jresult++;
      }
      else if (jA0 == j && jC0 == j) {
        result_vals[jresult] = 1;
        jresult++;
      }
      jA += (int32_t)(jA0 == j);
      jC += (int32_t)(jC0 == j);
      jB += (int32_t)(jB0 == j);
    }
    while (jB < pB2_end && jC < pC2_end) {
      int32_t jB0 = B2_crd[jB];
      int32_t jC0 = C2_crd[jC];
      int32_t j = TACO_MIN(jB0,jC0);
      if (jB0 == j && jC0 == j) {
        result_vals[jresult] = 1;
        jresult++;
      }
      jB += (int32_t)(jB0 == j);
      jC += (int32_t)(jC0 == j);
    }
    while (jA < pA2_end && jC < pC2_end) {
      int32_t jA0 = A2_crd[jA];
      int32_t jC0 = C2_crd[jC];
      int32_t j = TACO_MIN(jA0,jC0);
      if (jA0 == j && jC0 == j) {
        result_vals[jresult] = 1;
        jresult++;
      }
      jA += (int32_t)(jA0 == j);
      jC += (int32_t)(jC0 == j);
    }
  }
  return 0;
}

Where the case statement if ((jA0 == j && jC0 == j) && jB0 == j) does not seem correct since the intersection of all three tensors (A, B, and C) should not be included in the defined fused iteration space of xor(and(A, C), and(B, C)).

The above ops are defined as:

struct Boolean {
  ir::Expr operator()(const std::vector<ir::Expr> &v) {
    taco_iassert(v.size() >= 1) << "Add operator needs at least one operand";
    return ir::Literal::make(int64_t(1), v[0].type());
  }
};

struct xorAlgebra {
  IterationAlgebra operator()(const std::vector<IndexExpr>& regions) {
    IterationAlgebra noIntersect = Complement(Intersect(regions[0], regions[1]));
    return Intersect(noIntersect, Union(regions[0], regions[1]));
  }
};

struct andAlgebra {
  IterationAlgebra operator()(const std::vector<IndexExpr>& regions) {
    return Intersect(regions[0], regions[1]);
  }
};

struct xorAndAlgebra {
  IterationAlgebra operator()(const std::vector<IndexExpr>& regions) {
    auto m1 = Intersect(regions[0], regions[2]);
    auto m2 = Intersect(regions[1], regions[2]);
    auto noIntersect = Complement(Intersect(m1, m2));
    return Intersect(noIntersect, Union(m1, m2));
  }
};

Func xorOp1("logical_xor", Boolean(), xorAlgebra());
Func andOp1("logical_and", Boolean(), andAlgebra());
Func xorAndOp("fused_xor_and", Boolean(), xorAndAlgebra());

There is a simple fix to this which is to define the iteration algebra as:

struct xorAndAlgebra {
  IterationAlgebra operator()(const std::vector<IndexExpr>& regions) {
    auto m1 = Intersect(regions[0], regions[2]);
    auto m2 = Intersect(regions[1], regions[2]);
    auto noIntersect = Complement(Intersect(Intersect(regions[0], regions[1]), regions[2]));
    return Intersect(noIntersect, Union(m1, m2));
  }
};

The bug is fundamentally caused by the fact that the iteration lattice construction algorithm is incorrect when taking the cartesian product of points for the two input iteration lattices. Discussed this with @rohany @fredrikbk and @RawnH. The intersection rule in intersectLattices(MergeLattice left, MergeLattice right) and union rule in unionLattices(MergeLattice left, MergeLattice right are not correct in taking the cartesian product of all points. This is because if both left and right lattice have a shared input tensor (either as an iterator or locator) and then takes the cartesian product of points to merge the two lattices together, the cartesian product will try to merge some points that should NEVER intersect.

As an example: left: (~A U ~C) merged with right: (~B U ~C) will give the input lattices left: (i | A, C | O), (i | C | P), (i | A | P), (i | | P), ( | | ) right: (i | B, C | O), (i | C | P), (i | B | P), (i | | P), ( | | ) And when merging to find point (i | A, B, C | ?) we get these three point products:

(i | A, C | O) x (i | B, C | O)
(i | A | P) x (i | B, C | O)
(i | A, C | O) x (i | B | P) But, point products 2. and 3. should never occur since 2. and 3. iteration subspace regions on the left and right points DO NOT overlap (will be the empty set). This is not an issue in Fred's original thesis since the iteration lattice construction is only checking for existence of the newly merged point, which does not need to resolve duplicates. and the non-overlapping points will always produce a point that should be produced anyways.

To check for non-overlapping points when taking the cartesian product, Solution: Follow the root point of each input lattice to see which tensor inputs are NOT in the product points. From the example above, 2. left root has A, C but left point (i | A | P) is missing C and right point (i | B, C | O) has a C so they do not overlap in the iteration space. This same justification can be applied to 3. So the fix is to add a check (in pseudocode below):

product_points = {}
if ((tensor in left_root && !(tensor in left_pt) && tensor in right_pt) || 
    (tensor in right_root && !(tensor in right_pt) && tensor in left_pt) {
    hasIntersection = false
}
if (hasIntersection) 
    product_points += (left_pt, right_pt)

Therefore, the quick fix works since the iteration algebra does not need to merge two iteration lattices that have the same input tensor since the computation is (~A U ~C) merged with (~B) only.

The algorithm fix for this in taco can be found in the branch: https://github.com/tensor-compiler/taco/tree/array_algebra_iter_const

tensor-compiler / array-programming-benchmarks

taco: bug for user-defined fused function xor(and(A, C), and(B, C)) #29