the goal of this optimization is to order the instructions of the stencil to increase the data locality and potentially reduce the register usage of complex stencils (we need to see how well this works and implement this pass step by step).
initially the pass shall put all stencil accesses at the beginning of the stencil. Additionally, we order the accesses by the j-dimension (and possibly k-dimension). Assuming we have accesses at j-1, j, and , j+1, we put all i-accesses that access j-1 at the beginning of the stencil, followed by all accesses of j, followed by all accesses of j+1.
After sorting the stencil accesses and putting them at the beginning I expect a higher register usage since memory accesses and computation are separated. If this effectively the case we should proceed and optimize the register scheduling.
To optimize, the register scheduling we move all computation forward as much as possible. That way the computation should be closer to the memory accesses which reduces the register pressure.
To further optimize the register pressure, we may write an analysis pass that for every value computes the access with the maximal j-offset. We can use this operation to move computation further upwards using arithmetic properties such as associativity or commutativity. This optimization maybe implemented using patterns. For example:
given the two operations c = a + b; d = c + d where b depends on j+1 while the other values depend only on j then we by rewrite to c = a + d; d = c + b. This rewrite allows us to move the first add operation further up before the j+1 accesses.
We should at some point discuss if something like this makes sense. Halide seems to be able to generate good code using these tricks... However, overall there is not much benefit
the goal of this optimization is to order the instructions of the stencil to increase the data locality and potentially reduce the register usage of complex stencils (we need to see how well this works and implement this pass step by step).
initially the pass shall put all stencil accesses at the beginning of the stencil. Additionally, we order the accesses by the j-dimension (and possibly k-dimension). Assuming we have accesses at j-1, j, and , j+1, we put all i-accesses that access j-1 at the beginning of the stencil, followed by all accesses of j, followed by all accesses of j+1.
After sorting the stencil accesses and putting them at the beginning I expect a higher register usage since memory accesses and computation are separated. If this effectively the case we should proceed and optimize the register scheduling.
To optimize, the register scheduling we move all computation forward as much as possible. That way the computation should be closer to the memory accesses which reduces the register pressure.
To further optimize the register pressure, we may write an analysis pass that for every value computes the access with the maximal j-offset. We can use this operation to move computation further upwards using arithmetic properties such as associativity or commutativity. This optimization maybe implemented using patterns. For example:
given the two operations c = a + b; d = c + d where b depends on j+1 while the other values depend only on j then we by rewrite to c = a + d; d = c + b. This rewrite allows us to move the first add operation further up before the j+1 accesses.