stanford-ppl / spatial

Spatial: "Specify Parameterized Accelerators Through Inordinately Abstract Language"
https://spatial.stanford.edu
MIT License
274 stars 32 forks source link

Pipeline of Parallels vs Parallel of Pipelines for Unrolling #152

Closed mattfel1 closed 5 years ago

mattfel1 commented 6 years ago

Not sure if this is something we ever talked about or thought about already.

High level summary is that we currently unroll into a Pipeline of Parallels and I think in at least some cases, performance would be better if we unrolled into a Parallel of Pipelines.

For example:

Foreach(N by ts par 2){ i =>
  sram load dram
  Foreach(...){j => /*process sram*/
}

Since this unrolls into a Parallel with the two loads, who is a sibling with the Parallel with the processing controller, the two loads are equally congested every time they execute. If we were to unroll this into a Parallel where each child is a pipe with a load and a process stage, there would be initial congestion but then the two unrolled bodies will be dephased and should have less congestion.

I guess this could be fixed implicitly if we go to token based control, or do we still have the same hierarchy implicitly embedded in it? I haven't yet looked yet to see how hard it would be to modify the unroller to have the option to unroll the new way.

dkoeplin commented 6 years ago

Some initial thoughts: If I remember correctly, the banking logic currently assumes that a single stage within a metapipeline always completes together, even after outer loop parallelization, hence the synchs in the form of explicit Parallels.

I think the token based control is the ideal case, as things flow more smoothly in general for control. It think it does, however, make the order of execution slightly harder to reason about, especially when there are data-dependent conditionals, which could in turn make things like broadcasting across outer loops harder to guarantee safe.

mattfel1 commented 6 years ago

I expanded the definition and usages of lockstep before unrolling. Maybe I need to think about it more, but it seems like we can still do the pre-unrolling analyses correctly as long as these lockstep rules are updated and are correct for this situation.

If moving to token-based control means treating everything like a stream controller, I can see how it would be very hard to reason about broadcasting and banking. If its possible to start by making the token passing match the same control flow rules as the current hierarchical scheme, I think it could be a more gentle transition and potentially it would be easy to maintain both truly token-based control and normal (current) control implemented-with-tokens™. We would be able to at least share all the templates this way and save ourselves from crazy debug messes.

So without committing myself to any massive overhauls just yet, I think we may be able to try moving to token-based control independently of trying to do Pipe of Pars -> Par of Pipes unrolling.

mattfel1 commented 5 years ago

I'm going to mark this one as completed now. There is now a --pom flag to unroll all outer controllers as POM (parallel of metapipes). The default has been metapipe of parallels, and there is also a --mop flag to be sure. You can also annotate controllers with Pipe.MOP.Foreach and Pipe.POM.Foreach which will override the --pom/mop flag for that specific controller.

The thing that is missing is that this only works for Foreach controllers. It is harder to do Reduce and MemReduce as pom because they need synchronization between all of the unrolled lanes. The easy solution is to just convert them into stream controllers with ~2 or 4 elements in the communication fifos, but it feels morally wrong rewrite metapipes as stream controllers inside the compiler. For now you can just rewrite Reduce and MemReduce as Stream controllers and do the pom unrolling yourself.

Incidentally, I noticed that the lockstep check was assuming pom (it was looking at the cchains of all sibling controllers even if the parent is going to be unrolled). This is fixed now and does the correct thing for pom and mop. There were also some other things I fixed with broadcasting (see BroadcastStressTest app).

Overall it seems like every app in our regression that has the opportunity to POM is a little or a lot faster, but I imagine there are some industrial apps not part of our set that will benefit a lot. If there are no tile transfers, then there is at least a little bit of control flow overhead cut out. If there are tile transfers, they will dephase and automatically organize themselves to minimize congestoin. Not sure how it impacts area though. I assume that apps with good broadcasting opportunities will use the same area as before under pom but will use less under mop now that the thing I mentioned is fixed.