integrating SDC-based pipeline scheduling method into XLS?

google / xls

XLS: Accelerated HW Synthesis

http://google.github.io/xls/

Apache License 2.0

1.19k stars 174 forks source link

integrating SDC-based pipeline scheduling method into XLS? #615

Closed zihaoli-cn closed 2 years ago

zihaoli-cn commented 2 years ago

SDC-based scheduling [1] is an HLS scheduling method using Linear Programming(Not the ILP). Although using real-number variables, it is guaranteed to get integer solutions, because the constraint matrix is totally unimodular. It can be solved efficiently.

I read the code in the scheduling part in XLS. The objective is to minimize the pipeline registers.

As for the objective function, it is very similar to [2]'s 4.2 part, "Minimum-Lifetime Schedule over SDC". It introduces the life-time variables to the model. For a value v in XLS IR, bit-count(v) * life-time(v) is just its pipeline registers usage.
As for the cycle period time constraints, section 3.2.2 in [1] uses "critical combinational path" to formally describe it. But I don't know how to implement it efficiently. It needs to compute critical paths between any pair of nodes.

I am interested in integrating the SDC-based scheduling method into XLS and comparing the result and running time with XLS's original algorithm. We can use OR-Tools as the solver.

[1] Cong, Jason, and Zhiru Zhang. "An efficient and versatile scheduling algorithm based on SDC formulation." 2006 43rd ACM/IEEE Design Automation Conference. IEEE, 2006. [2] Zhang, Zhiru, and Bin Liu. "SDC-based modulo scheduling for pipeline synthesis." 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2013.

taktoa commented 2 years ago

We have discussed using SDC in our scheduler, especially because we are interested in implementing exact cycle constraints on send/receive nodes for communication with fixed-latency IPs like SRAMs. The current scheduler optimizes for number of registers, number of pipeline stages, and critical combinational path length.

Optimizing for number of registers in the SDC formulation is easy; if the (minimized) objective is c dot x then for every edge in the dataflow graph betwen node i and node j with bitwidth w, you subtract w from c[i] and add w to c[j], which is equivalent to adding w * (x[j] - x[i]) to the objective function (and the edge itself induces a timing constraint such that x[j] >= x[i]).

Optimizing for number of pipeline stages is also easy. You add an extra variable n as well as constraints that for all i, n - x[i] >= 0. Then if n is part of the objective function, it will be the least upper bound of the stage numbers of all nodes, i.e.: the number of stages.

I haven't been able to figure out how to optimize for critical combinational path length without introducing exponentially many variables/constraints. It may be somewhere in those papers; I haven't looked at them in a while. It feels like it would be difficult to encode in the SDC formalism, but maybe there's a clever way I haven't thought of.

taktoa commented 2 years ago

Inspired by reading citation [2], it seems that a simple way to implement the critical combinational path constraint is by finding, for every pair of nodes i and j, the critical path between i and j. If that number is greater than the clock period, then x[j] - x[i] >= 1. Then you can binary search the clock period (running the LP solver a logarithmic number of times).

@meheff probably has opinions around whether we would accept a pull request implementing this.

zihaoli-cn commented 2 years ago

@taktoa Thank you for replying~

I need to clarify a point. To optimize the total pipeline registers, directly using w * (x[j] - x[i]) will not produce the correct objective value. If node i has multiple users, the node i needs to be saved into the register until the last user consumes it.

So, in the citation [2], it introduces a lifetime variable for each node. It is just the duration between the node finishes until the last user consumes it. If node j uses node i, the constraint will be x[j] - x[i] -delay[i] <= lifetime[i].

meheff commented 2 years ago

The large number and size of the dependencies this pulls in makes me reluctant to to accept this. However, if this approach turns out to offer significant advantages over the existing technique I can certainly change mind. I would definitely be interested in seeing how well this SDC based scheduler works. The existing technique does not guarantee optimality with respect to the number of registers required to implement the pipeline. The underlying algorithm uses a sequence of mincuts (one for each boundary between pipeline stages) to choose where the pipeline registers lie. Each invocation of the mincut algorithm determines the minimum number of registers for that stage boundary given the scheduling decisions made so far, but global optimality (sum of pipeline registers across all boundaries) is not guaranteed. That is, the approach is only piecewise optimal.

zihaoli-cn commented 2 years ago

Hi @meheff . Thanks for replying.

I read the code of the min cut-based scheduling algorithm. By doing a series of min cut, each node's scheduling bound will gradually shrink into a smaller and smaller interval. Finally, the length of each node's scheduling bound will be one. I really appreciate the elegant idea of the algorithm design. And I learned a lot from it.

After receiving Remy's review, I reimplemented the PR. There will be much fewer dependencies.

SDC-scheduling is an influential class of scheduling methods. I noticed that the citation [1] entered the 2022's TCFPGA Hall of Fame. I am quite interested to see whether its actual performance can live up to its fame.

Scheduling problems mainly include timing constraints and resource constraints. The existing scheduling method can handle the timing constraint quite well. By the way, are you considering adding constraints on resources in the future? For example, certain types of computing resources and on-chip area are limited. Does this kind of demand exist in your scenario?

meheff commented 2 years ago

We haven't considered adding resource constraints to the scheduler. Generally, in XLS the operations in the IR (e.g, adds, multiplies) are the resources. So an add in the XLS IR maps to an add in verilog which becomes an adder in hardware. So reuse of resources becomes a transformation of the IR. For example, to have two XLS IR add operations share an adder would require transforming the two add operations into a single one with the different operands muxing in. This approach decouples the resource allocation and scheduling problems. However, you could imagine a more capable scheduler which also performed resource reuse at the same time. You'd have to also model the additional cost of the muxes for switching in different operand sets.