Loop Perfecter - Githubissues

stanford-ppl / spatial

Spatial: "Specify Parameterized Accelerators Through Inordinately Abstract Language"

https://spatial.stanford.edu

MIT License

274 stars 32 forks source link

Loop Perfecter #99

Open mattfel1 opened 6 years ago

mattfel1 commented 6 years ago

In cases like

Foreach(...){i => 
  mem(i) = Reduce(Reg[Int])(...){...}{...}
}

We get an outer Foreach with a Reduce stage and a Unit pipe for the sram store. We can optimize this by putting a predicated sram store in the same body as the Reduce stage so that banking calculation can be resolved in parallel and we don't have to pay the full latency in its own stage

dkoeplin commented 6 years ago

(Not important but this is vertical fusion by usual loop fusion terminology)

dkoeplin commented 6 years ago

By the way the term I was looking for before was "Loop perfection" (https://arxiv.org/pdf/1610.09405.pdf)

mattfel1 commented 6 years ago

I want to start by catching an inner UnrolledReduce followed by a UnitPipe that reads the result and writes it to a banked memory. I think we need to either put this before memory analysis so that we don't have to start recomputing buffers and deleting registers, or put it right before the flattening transformer but then have an extra pass for reanalyzing everything.

In general, we can probably check the reaching writes to all memories in inner control 1 relative to inner unit pipe 2 and then mash them together and obey the dependencies?

dkoeplin commented 6 years ago

Doing this fusion before memory analysis seems like a good idea to me. My only fear was missing information that an operation occurs exactly once during the loop, not an arbitrary (data dependent) number of times, but we can keep that information as metadata if we need it. This seems strictly better in that case.

Maybe we can keep it as a flag to begin with to make sure it doesn't increase memory costs anywhere in any unexpected ways?