scheduling: inefficient code generated for CPU spmv with pos split

rohany commented 3 years ago

The code generated by taco when doing a pos split looks like this:

  for (int32_t fposA = 0; fposA < A2_pos[A1_dimension]; fposA++) {
    if (fposA >= A2_pos[A1_dimension])
      continue;

    int32_t f = A2_crd[fposA];
    while (fposA == A2_pos[(i_pos + 1)]) {
      i_pos++;
      i = i_pos;
    }
    y_vals[i] = y_vals[i] + A_vals[fposA] * x_vals[f];
  }

This code is inefficient because there is a missed opportunity for register promotion here -- values can be accumulated into a register for each i, and only written out to memory when i actually changes. This new code looks something like this:

  T local = 0;
  for (int32_t fposA = 0; fposA < A2_pos[A1_dimension]; fposA++) {
    if (fposA >= A2_pos[A1_dimension])
      continue;

    int32_t f = A2_crd[fposA];
    while (fposA == A2_pos[(i_pos + 1)]) {
      i_pos++;
      y_vals[i] += local;
      local = 0;
      i = i_pos;
    }
    local += A_vals[fposA] * x_vals[f];
  }

This optimization leads to around a 3x performance improvement for me on a test case. Without it, the pos split kernel is 3x slower than the standard spmv top-down approach that is also able to keep most writes in a register. I started looking into this because of the performance discrepancy when applying the pos split.

Some care has to be taken here when parallelizing the outer loop, as the local variable needs to be thread local. I'm not sure if there's a general way to do this.

rohany commented 3 years ago

I don't think this code is quite right yet (still working on it), but the idea seems fine -- we want to buffer changes to the same output location i for as long as possible, and only write it out when we change i.

rohany commented 3 years ago

I don't see an easy way to do this when trying to use multiple threads. It looks like you need some way of having a check after each parallel block to flush the value of the private variable out to the resulting location. However, such a strategy would definitely help the performance by decreasing the number of atomic write instructions. Have you looked into something like this @stephenchouca ?

tensor-compiler / taco

scheduling: inefficient code generated for CPU spmv with pos split #489