Open mattfel1 opened 6 years ago
(Not important but this is vertical fusion by usual loop fusion terminology)
By the way the term I was looking for before was "Loop perfection" (https://arxiv.org/pdf/1610.09405.pdf)
I want to start by catching an inner UnrolledReduce followed by a UnitPipe that reads the result and writes it to a banked memory. I think we need to either put this before memory analysis so that we don't have to start recomputing buffers and deleting registers, or put it right before the flattening transformer but then have an extra pass for reanalyzing everything.
In general, we can probably check the reaching writes to all memories in inner control 1 relative to inner unit pipe 2 and then mash them together and obey the dependencies?
Doing this fusion before memory analysis seems like a good idea to me. My only fear was missing information that an operation occurs exactly once during the loop, not an arbitrary (data dependent) number of times, but we can keep that information as metadata if we need it. This seems strictly better in that case.
Maybe we can keep it as a flag to begin with to make sure it doesn't increase memory costs anywhere in any unexpected ways?
In cases like
We get an outer Foreach with a Reduce stage and a Unit pipe for the sram store. We can optimize this by putting a predicated sram store in the same body as the Reduce stage so that banking calculation can be resolved in parallel and we don't have to pay the full latency in its own stage