google / xls

XLS: Accelerated HW Synthesis
http://google.github.io/xls/
Apache License 2.0
1.17k stars 168 forks source link

Delay model "2.0" #370

Open cdleary opened 3 years ago

cdleary commented 3 years ago

This is somewhat a scoping / overarching issue. We want to improve our delay model -- arguably it does things with mathematical precision (and explanatory power) that a RTL designer does ad hoc in their heads when designing pipelines, which creates some baseline level of advantage for scheduling across cycle boundaries. However, it still has a bunch of imprecision, especially due to the fact that operations do not combine linearly in their delay within a cycle when boolean optimizations run on them, and things like path duplication occur on critical paths within the cycle to fit within a certain delay target (generally done by the RTL synthesis part of the toolchain).

Fundamentally, to improve the delay model we probably want to stand up a data gathering mechanism that we can throw lots of interesting input patterns at (for a target node) in parallel to characterize it and fit to some underlying model. We have some of this with work @meheff has done on the new delay characterization utilities. Even if you did something really fancy like an ML GraphNet you'd need this critical data collection piece of the puzzle.

https://github.com/google/xls/blob/20e6aa68efa9165504f09b8c165031727ed55545/xls/synthesis/synthesis.proto#L21

Between Skywater, ASAP7, and symbiflow toolchain rules this will be easier to develop quickly and in open source -- following that the same mechanisms can be pointed at downstream RTL synthesis flows that users have once it's stood up / tested with the open flow. We already have the "synthesis server" abstraction we can talk to over gRPC where we can swap out "providers" (how exactly the backend flows work). This will also build out from our current "unit" / Lattice part delay model examples to something that's a demonstrated ASIC process possible to tape out.

Sub-models At a technical level, with easier use of the backend flows we can investigate sub-models for operations that take natural tree-of-cells form at some bit width. I suspect if you have sufficient bitwidth you don't need to worry about logical effort as much, because the change in drive strength of the cells smooths out over the depth of the tree.

Successive refinement from a first cut There's also a notion here of "gearing" or "successive refinement" where we may be able to spit out something quickly and refine our estimates if we can run things in the background to refine those estimates, e.g. you pop out circuit C_0 and then two weeks later you gather concurrently collected numbers so you can make C_1 that's much more efficient.

Big picture There's this fundamental premise in HLS land of: create a delay model we can reasonably abstract (above the cell mapping level) to get quick feedback and make more progress than an RTL designer would at a mental cut of estimating across cycles, or a key advantage to be had by leveraging higher level information as much as possible (e.g. thinking in terms of bit-lines or other parallel patterns). If the designer is following some algorithm or heuristic you can try to imbue that in the delay model analysis, but you can also give them the tools to massage the outcome if it feels counter-intuitive in ways that are difficult to describe/encode algorithmically. Those sorts of possibilities always remind me of computer-aided chess. Also, with fast backend tools (like we're seeing as potentially possible to obtain with some of the open PDKs) we can iterate against the full STA flow to refine our candidate circuit under the precondition of "closes timing". It's clearly a rich space of possibilities!

sedaogrenci commented 3 years ago

Some initial thoughts:

Use patterns (sub-models mentioned in the dos) - characterize coarse grain patterns extensively, tolerate some error at the boundaries

Is it possible to identify affinity groups (operations/patterns that always tend to be blended during minimization by the logic synthesis tool) versus incompatible types (if they exist?) that tend to maintain “uncrossable” boundaries despite the synthesis tool pushing hard to fix timing

Operations/pins that are connected to special signals (enable, control, interrupt), which dictate for one or the other reason “predictable” timing outcomes or at least predictable critical paths...

The literature for delay estimation is rich for FPGAs, are there things we can borrow for ASICs, especially from the more heterogeneous devices which deal with variable width DSPs, etc.?

A recent paper I came across talks about learning from patterns within microbenchmarks and then translating them to an arbitrary structure. Would this help to learn some mapping/minimization patterns and their relationship to delay?