llvm / circt

Circuit IR Compilers and Tools
https://circt.org
Other
1.62k stars 283 forks source link

[DC] Connecting up DC during incremental lowering #5696

Open teqdruid opened 1 year ago

teqdruid commented 1 year ago

I'm working on lowering to DC and I'd like to make a set of passes which do so incrementally rather than having to do it monolithically in one shot.

A real example:

// Original code
ibis.class @C {
  ibis.method @math(%a: i32, %b: i32) -> i32 {
    %1 = comb.add bin %a, %b : i32
    ibis.return %1 : i32
  }
}

ibis.class @User {
  ibis.instance @c, @C
  ibis.method @wrapper(%a: i32, %b: i32) -> i32 {
    %x = ibis.call @c::@math(%a, %b): (i32, i32) -> i32
    ibis.return %x : i32
  }
}

The ibis-call-pass as a first step converts the arguments into a single struct:

  ibis.class @C {
    ibis.method @math(%arg: !hw.struct<a: i32, b: i32>) -> i32 {
      %a, %b = hw.struct_explode %arg : !hw.struct<a: i32, b: i32>
      %0 = comb.add bin %a, %b : i32
      ibis.return %0 : i32
    }
  }
  ibis.class @User {
    ibis.instance @c, @C
    ibis.method @wrapper(%arg: !hw.struct<a: i32, b: i32>) -> i32 {
      %a, %b = hw.struct_explode %arg : !hw.struct<a: i32, b: i32>
      %0 = hw.struct_create (%a, %b) : !hw.struct<a: i32, b: i32>
      %1 = ibis.call @c::@math(%0) : (!hw.struct<a: i32, b: i32>) -> i32
      ibis.return %1 : i32
    }
  }

Notice that it only affects the method boundaries -- doesn't touch or analyze the comb.add.

The next step is to DC-ify the calls/methods, meaning wrap the method arguments with dc.values. Ideally, I would like to do this incrementally like I do with the last step -- changing the method signatures and add dc.pack/dc.unpack, so we don't have to monolithically convert all the control flow over to DC. Those two operations, however, consume/produce a token. Wiring the tokens together is not necessarily the right thing to do -- the control flow may dictate something different. Also, where do I get the token for the call's pack/unpack?

To demonstrate:

// class C omitted
ibis.class @User {
  ibis.instance @c, @C
  ibis.method @wrapper(%arg: !dc.value<!hw.struct<a: i32, b: i32>>) -> !dc.value<i32> {
    // Unwrap the args.
    %inputToken, %unwrapped = dc.unpack %arg : !dc.value<!hw.struct<a: i32, b: i32>>
      // What should I do with %inputToken?
    %a, %b = hw.struct_explode %unwrapped : !hw.struct<a: i32, b: i32>

    // Wrap for the method call.
    %0 = hw.struct_create (%a, %b) : !hw.struct<a: i32, b: i32>
    %tokenForCall = ... // Where should I get this?
    %valueForCall = dc.pack %tokenForCall, %0 : !hw.struct<a: i32, b: i32>
    %1 = ibis.call @c::@math(%0) : (!dc.value<!hw.struct<a: i32, b: i32>>) -> !dc.value<i32>
    // Unpack the result.
    %callReturnToken, %retData = dc.unpack %1 : !dc.value<i32>
      // What should I do with %callReturnToken?

    // Pack the result.
    %methodReturnToken = ... // where should I get this Value?
    %ret = dc.pack %methodReturnToken, %retData : i32

    ibis.return %ret : !dc.value<i32>
  }
}

Suggestions? Perhaps the answer is to do all the control flow to DC in one pass. I'd rather not do that since it requires that one pass have knowledge of all the control flow constructs (and we're planning to add some custom control flow to ibis) so it's not just CF.

I'll post my suggested options as comments below so as to not frame the discussion as a selection between my options.

teqdruid commented 1 year ago

Option 1:

Create a set of ops to sink and source tokens. These would have to be lowered away at some point in a later lowering pass. They would semantically be scoped to the function and be defined as producing a token when the function is run and only returning when all of the sink'd tokens are present.

Option 2:

Wire up things independently of control flow. Feed the %inputToken to every dc.pack in the method. The "return pack" would be a dc.join of all the unpacked tokens.

Option 3:

Much like in #5477's Option 2, create an explicit unit-rate actor block and stash all the logic in there. Said op would also have to provide an "execute" token and have some way of sinking/joining "done" tokens so that I have some way of packing/unpacking method call arguments.

Option 4:

Make the ibis.method op a CFGSSA region (which I think matches the language semantics) and treat each operation as a separate unit-rate actor. This wouldn't work on its own -- operations which contain blocks would have to be treated separately.

Overall theme

I think the first 3 of my proposed options would essentially imply unit-rate semantics, which seems a reasonable semantic to me at this point in the lowering process. Note that they are not entirely mutually exclusive.

I'm leaning towards options 3 and 4. Option 3 would solve the pipeline interface problem as well. Or at least push it down. For option 4, something has to know about the control flow semantics of ibis.method and I'm not opposed to introducing that logic here. I'm also imagining that further lowerings which understand the control flow ops could break up the unit rate block, allowing incremental lowering amongst them as well. Perhaps the block which would be introduced in option 3 would have SSACFG semantics so a different pass could use that control flow data to break up the unit op.

mortbopet commented 1 year ago

I'm progresively disliking option 3 more an more - I'd like to see DC to be strictly a control language, where other dialects can interface with DC by using DC types. Having a unit-rate actor op with an inner block puts a lot of load on that DC op to contain data-related stuff within it. Especially given that a unit-rate actor can be constructed from other DC ops + e.g. ESI valid/ready wrap-unwrapping.

option 4: (i suppose you mean ibis.method here). This very much smells like Handshake, and the approach taken in standard-to-handshake (i.e. dhls). The issue here is identical to DHLS in that there may be operations within the function which will affect the control flow, as well as having to be performed in-order (i.e. two function calls after one another has a strict ordering, which implies that validity and backpressure must be wired through them according to this order).

Wiring the tokens together is not necessarily the right thing to do -- the control flow may dictate something different

Why not? This is the view i have of DC/handshake/dhls. In your example above, there is no branching control flow, so the IR could be converted verbatim to

  ibis.method @wrapper(%arg: !dc.value<!hw.struct<a: i32, b: i32>>) -> !dc.value<i32> {
    // Unwrap the args.
    %inputToken, %unwrapped = dc.unpack %arg : !dc.value<!hw.struct<a: i32, b: i32>>
      // What should I do with %inputToken?
    %a, %b = hw.struct_explode %unwrapped : !hw.struct<a: i32, b: i32>

    // Wrap for the method call.
    %0 = hw.struct_create (%a, %b) : !hw.struct<a: i32, b: i32>
    %valueForCall = dc.pack %inputToken, %0 : !hw.struct<a: i32, b: i32>
    %1 = ibis.call @c::@math(%0) : (!dc.value<!hw.struct<a: i32, b: i32>>) -> !dc.value<i32>
    // Unpack the result.
    %callReturnToken, %retData = dc.unpack %1 : !dc.value<i32>

    // Pack the result.
    %ret = dc.pack %callReturnToken, %retData : i32
    ibis.return %ret : !dc.value<i32>
  }

I think a big issue is that e.g. in

  ibis.method @math(%arg: !hw.struct<a: i32, b: i32>) -> i32 {
    %a, %b = hw.struct_explode %arg : !hw.struct<a: i32, b: i32>
    %0 = comb.add bin %a, %b : i32
    ibis.return %0 : i32
  }

we're mixing in implicit control flow semantics to the values. i.e. %arg contains implicit control flow, as does the return value, but we know that %a and %b does not. The right thing here seems to either have everything be a unit rate actor, or nothing - implying that we have explicit control flow. Cannot mix them without gifting ourselves a massive headache and confusion.

So yes, as you say, the endgame here is most likely just that we have to treat everything as a unit rate actor. This makes sense to me since this is also the conclusion that has been reached by DHLS. Is there a way around a large, monolithic conversion, then? ... well... in theory, if we do the above, then the correct intermediate abstraction that you're looking for is Handshake and not DC. And then use Handshake-to-DC lowering afterwards. That is, the ibis.method @math above has exactly the same semantics as:

  handshake.func @math(%arg: !hw.struct<a: i32, b: i32>) -> i32 {
    %a, %b = hw.struct_explode %arg : !hw.struct<a: i32, b: i32>
    %0 = comb.add bin %a, %b : i32
    ibis.return %0 : i32
  }
teqdruid commented 1 year ago

I'm progresively disliking option 3 more an more -

I'm liking it more and more.

I'd like to see DC to be strictly a control language, where other dialects can interface with DC by using DC types. Having a unit-rate actor op with an inner block puts a lot of load on that DC op to contain data-related stuff within it

I don't understand how this introduces data-related stuff into DC any more than pack/unpack. The block I'm proposing (and you were) just says "unpack this dc.value, hand off control flow to this block, then pack the result". It's the same as handing a value off to a pipeline -- "here's the unpacked value, 'go', then pack the result".

I don't see a DC op which wraps a block (which has both control flow and data flow) as involving DC in data flow since the DC ops and lowerings don't have to reason about the data flow. SCF is a control dialect but its ops have blocks which have computation.

Especially given that a unit-rate actor can be constructed from other DC ops + e.g. ESI valid/ready wrap-unwrapping.

How would I do that incrementally? I'd have the same problem as with DC.

Wiring the tokens together is not necessarily the right thing to do -- the control flow may dictate something different

Why not? This is the view i have of DC/handshake/dhls. In your example above, there is no branching control flow, so the IR could be converted verbatim to

I agree in that example that it's entirely valid. But what about in the presence of loops? A wait_for?

(i suppose you mean ibis.method here)

Yes I did. Updated.

in theory, if we do the above, then the correct intermediate abstraction that you're looking for is Handshake and not DC. And then use Handshake-to-DC lowering afterwards.

I'll have to give that some thought. I'd rather not use Handshake, but I may just be biased against fine-grained dynamic data flow. It's not easy to recover the control flow especially in the presence of side-effects and the idea of "recovering" semantics is exactly what MLIR is intended to avoid. It also implies that DC isn't useful for much without handshake.

we're mixing in implicit control flow semantics to the values.

I'm pretty sure that an Ibis method is standard SSACFG control flow, not dataflow as the op is currently marked. So the control flow isn't implicit -- it's linear. Other ops (e.g. CF) will add their own well-defined control flow semantics. Later on, it will contain multiple blocks and use the CF dialect.


Thoughts:

1) It probably makes sense to have the proposed container op have explicit control semantics -- SSACFG. I agree that if the control semantics aren't well defined, it probably doesn't make any sense to have that op in DC since a lowering wouldn't have the right knowledge to lower it correctly. If it's really dynamic dataflow, Handshake might make sense. (Again, I have some scarring from a previous life which biases me against fine-grained dynamic dataflow.)

2) If you remain uncomfortable with this op being in the DC dialect, it probably makes sense to have in the Ibis dialect. We're going to need a lowering at some point to convert the control flow constructs to DC, some of which will be ops in the Ibis dialect. I'd hope there are lowerings -- perhaps not passes as much as a public library of OpConversionPatterns (an idea I'm liking more and more) -- to lower ops in the CF dialect (perhaps other upstream control dialects as much as makes sense). An op like this makes sense to me to allow the "ibis-to-dc flow" to be somewhat more modular/incremental -- to make it valid IR before and after a pass to enable breaking it up into an arbitrary number of micro-passes. This feels more general than Ibis, though I do understand a reluctance to add more shit into a relatively simple, general dialect like DC (lest it become handshake).

3) Having this op would probably allow @blakep-msft to "dip his toe" into the DC waters -- we could use it for method calls but the inside of methods could be arbitrary code.

blakep-msft commented 1 year ago

I think the dip-my-toe strategy is important. It would be great if the code inside of methods didn't have to use DC from day 1.

mortbopet commented 1 year ago

In the original issue description, you mention:

Perhaps the answer is to do all the control flow to DC in one pass. I'd rather not do that since it requires that one pass have knowledge of all the control flow constructs (and we're planning to add some custom control flow to ibis) so it's not just CF.

I agree, it would be a shame to have to have everything in one big pass. However, I think it is very hard to do incremental control-flow lowering. Based on handshake experience, I don't see much of a way around doing it in one big pass, unless we want to keep a lot of state-IR around.

One thing we haven't written out in this discussion is what if we do have some of those other control flow constructs, e.g. an scf.for?

ibis.class @User {
  ibis.instance @c, @C
  ibis.var @var : memref<i32>
  ibis.method @wrapper(%arg: !hw.struct<a: i32, b: i32>) -> i32 {
    %a, %b = hw.struct_explode %arg : !hw.struct<a: i32, b: i32>
    %0 = hw.struct_create (%a, %b) : !hw.struct<a: i32, b: i32>
    %x = ibis.call @c::@math(%0): (i32, i32) -> i32
    %cmp = arith.cmp slt %x, %a : i32

    %this = ibis.path [
      #ibis.step<parent : !ibis.scoperef<@User>>
    ]
    %var = ibis.get_var %this, @var : !ibis.scoperef<@User> -> memref<i32>

    // If construct. Most likely mux-converted but could in theory also be converted to DC.
    scf.if %cm {
      scf.for %i = %x to %a {
        %var = memref.load %var : memref<i32>
        %var = arith.addi %var, %i : i32
        memref.store %var, %var : memref<i32>
      }
    } else {
      memref.store %x, %var : memref<i32>
    }
    ibis.return %x : i32
  }
}

Would we then have to place that scf.for inside the unit-rate actor - but then at what point do we lower the scf.for, and what if it has nested loops as well?

Although it would be a big and complex pass, I can picture how I'd go about writing a pass for converting the above to DC. it'd go something like:

  1. Iterate over the ops in order
  2. Maintain a list of operations which are not control-flow operations,
  3. Whenever we encounter a control flow operation, move the prior list into a ibis.unit operation
  4. Then, perform control flow lowering of the control flow operation, and continue.
    • the above may be recursive (in case of control flow operations with nested regions).

Doing a manual lowering, that'd mean that the above would result in something like:

ibis.class @User {
  ibis.instance @c, @C
  ibis.var @var : memref<i32>
  ibis.method @wrapper(%arg: !dc.value<!hw.struct<a: i32, b: i32>>) -> !dc.value<i32> {
    // dc.unit.li doesn't have input valid/output done. This would imply that dc.unit.li
    // is an abstraction which sees its internals as having latency-insensitive execution
    // This would then imply that there is a separate pass which converts dc.unit.li to a dc.unit.ls op
    // by taking the body of the dc.unit.li, showing it into a pipeline and connecting the pipeline
    // to the explicit valid/done signals of the dc.unit.ls.
    %1, %2 = ibis.unit.li (%arg : !dc.value<!hw.struct<a: i32, b: i32>>) -> (!dc.value<!hw.struct<a: i32, b: i32>>, !dc.value<i32>) {
      ^bb(%arg: !hw.struct<a: i32, b: i32>)
      %a, %b = hw.struct_explode %arg : !hw.struct<a: i32, b: i32>
      %0 = hw.struct_create (%a, %b) : !hw.struct<a: i32, b: i32>
      return %0, %a : !hw.struct<a: i32, b: i32>, i32
    }
    %x = ibis.call @c::@math(%1): (!dc.value<!hw.struct<a: i32, b: i32>>) -> !dc.value<i32>
    %3 = ibis.unit.li (%x : !dc.value<i32>, %2 : !dc.value<i32>) -> (!dc.value<i1>) {
      ^bb(%x: i32, %2: i32)
      %cmp = arith.cmp slt %x, %a : i32
      return %cmp : i32
    }    

    // scf.if -> dc.branch
    %taken, %not_taken = dc.branch %3

    // === scf.if true body
    // Lowering will analyse the body, figure out which values are live in (%x, %a (=> %2)) and
    // join them with the incoming control token (taken)
    %x_taken = dc.join %taken, %x // short hand for unpacking %x from its token, join, and repacking %x
    %2_taken = dc.join %taken, %2

    // ... here will be the whole loop construct (loop priming register, loop
    // counter, ...). That will then generate a control token for the loop body/
    // The loop body will then be lowered like above, with non-control ops being
    // emitted in a ibis.unit.li, and if a control op is present (i.e. nested loops),
    // we will convert the nested loop body using the incoming control token).

    %loop_done = ... : !dc.token

    // === scf.if false body -
    // lowering will analyse the body, figure out which values are live in (%x) and
    // join them with the incoming control token (not_taken)
    %x_not_taken = dc.join %taken, %x
    %x2 = ibis.unit.li (%x_not_taken : !dc.value<i32>) {
      ^bb(%x: i32)
      %this = ibis.path [#ibis.step<parent : !ibis.scoperef<@User>>]
      memref.store %x, %var : memref<i32>
      return %x : i32
    }

    // End of scf.if lowering - each branch will have an outgoing control token
    // registered 
    %if_done = dc.merge %loop_done, %x2
    %x_ret = dc.join %if_done, %x
    ibis.return %x_ret : i32
  }
}

Now obviously, there is a lot of that stuff that is essentially a no-op (like the first ibis.unit) but that would get optimized away at a later stage - the important thing is the procedure of switching between stuffing things into ibis.unit ops, and lowering control flow constructs.
Secondly, we'd for the above be using the "public" SCF-to-DC lowering pattern library alongside patterns for our custom Ibis ops. And lastly, as a nit, you mention CF a couple of times in the prior discussion - AFAIK, we're not going to encounter unstructured control flow, hence there should be no need to consider CF here, since all structured control flow can be raised to SCF now (https://reviews.llvm.org/D156889).