google / xls

XLS: Accelerated HW Synthesis
http://google.github.io/xls/
Apache License 2.0
1.2k stars 174 forks source link

Optimize chain of selects with similar cases #538

Open meheff opened 2 years ago

meheff commented 2 years ago

The targeted pattern looks something like:

s0 = sel(p0, cases={Y, f(x0)})
s1 = sel(p1, cases={s0, f(x1)})
s2 = sel(p2, cases={s1, f(x2)})

Where f is some expression. In this case we want to transform the code from selecting among the various outputs of the simliiar expressions to selecting the inputs to the expression. So transform the above to something like:

tmp0 = sel(p1, cases={x0, x1})
tmp1 = sel(p2, cases={tmp0, x2})
tmp3 = f(tmp1)
s2 = sel(p0|p1|p2, cases={Y, tmp3})

If x0/x1/x2 is narrower than s0/s1/s2 this reduces the size of the muxes and also reduces the delay if the select chain s0->s1->s2 is on the critical path (which select chains often are).

Example from a benchmark:

Y = ...
a0 = concat(Y[0:126], bits[2]:0)
s0 = sel(p0, cases={Y, a0)
a1 = concat(Y[0:126], bits[2]:1)
s1 = sel(p1, cases={s0, a1})
a2 = concat(Y[0:126], bits[2]:2)
s2 = sel(p2, cases={s1, a2})

This could be trasnformed into:

Y = ...
tmp0 = sel(p1, cases={bits[2]:0, bits[2]:1})
tmp1 = sel(p2, cases={tmp0, bits[2]:2})
a = concat(Y[0:126], tmp1)
s2 = sel(p0|p1|p2, cases={Y, a})

In this case we're select among three 2-bit constants rather than a chain of 128-bit selects.

The complexity here is in matching the similar expressions (f in the example). This could be done via a traversal starting with a couple of the the selects at the bottom to construct the pattern to match. The matching would identify three different possibilities in the pattern:

  1. identical operations in the different expressions. The different expressions have the same node in the same location. In the benchmark example this is the bit slice of Y: Y[0:126]

  2. equivalent operations in the different expressions. The different expressions have equivalent nodes in the same location. In the benchmark example this is the concat operation.

  3. Variable operations which are different in each case. In the benchmar example this is: bits[2]:0, bits[2]:1, and bits[2]:2.

We probably can start with something very limited (very small number of nodes in matching expressions). Maybe start the matching with the bottom two selects in the chain to build up the pattern to match, then match as far up the chain as possible. This analysis is non-trivial and would require some thought to minimize complexity and keep things relatively fast.

taktoa commented 2 years ago

I just realized that this can eventually be just a particular type of predicate in the predicated nodes stuff. As you point out though, it can be more efficient to share an entire small region of nodes than each pair individually, so that efficiency calculation should probably factor into the decisions of which nodes to share and how to share them.