Open Dinistro opened 2 years ago
Thank you for filing the issue - this is one point which has been completely untouched so far, and so any research into how to best do this would be interesting. Both this, as well as passing a memref block argument to other function calls.
As you mention, a correct lowering here would indeed be that a handshake.memory
op is constructed in the @normal_mem
function. But then, how does that work with our current schema where we allow passing memref
s to arguments of handshake.func
s? Becuase there's obviously a mismatch there.
I've never been a fan of handshake.memory
and handshake.extmemory
having essentially the same interface and almost doing the same - handshake.memory
both allocates and creates an interface to a handshaked memory. handshake.extmemory
only creates an interface - so we have some redundancy in the IR which creates this funky situation.
Thinking about this, i'm wondering whether memories in the handshake dialect should be modifed such that:
handshake.memory
is removed, and we allow memref.alloc
operations in handshake IRhandshake.extmemory
is modified to a handshake.meminterface
, but is kept identical to its current implementation.By doing this we no longer have this mix of handshake memory ports only sometimes being materialized.
I'll try taking a look at the memref.alloc/handshake.meminterface
to see if it makes sense.
Did you gain any insights so far?
One problem I'm seeing with the approach you are mentioning is when lowering these operations. While allowing memref
types on the handshake level resolves certain things, a meminterface
operation will influence the interface of the circuit. The circuit interface will depend on the number of loads and stores interacting with the memory. This can be dealt with when the memref
accepting function and its caller(s) are in the same compilation unit. In the case of different compilation units, with external functions that accept memory, this will break down as one doesn't know the exact interface from the external handshake::FuncOp
definition.
Correct, the approach requires that you have full awareness of the entire circuit and can analyze your way to statically determine the # of unique load/store accesses to any given memory. I started on this analysis approach, but work is currently taking priority.
This is not an easy problem to solve. I sense that the core of the problem lies in the fact that memory complexity is essentially fully outsourced to callers of the circuit. This goes not just for generating a unique load/store interface for every access in the program, but also the ability to handle runtime memory conflicts. This requires that a load-store queue is attached to the interface of the lowered handshake function. Something which i am 99% confident no one has yet to do nor make any progress on.
Ideas are welcome!
I started on this analysis approach, but work is currently taking priority.
No worries. It seems that I'm to only "active" user of these parts of the handshake dialect, thus I should be the one investing time into that. I'll therefore invest some time into this issue, but it might be that we declare this part as out-of-scope for my thesis. Most of the use-cases that would require memory would either way require a load-store queue with conflict resolution to be even remotely efficient.
I plan to schedule a meeting with Lana Josipović, and this might be something I would like to discuss with her.
load-store queue with conflict resolution to be even remotely efficient.
efficiency, at the cost of possibly huge resource usage - LSQs are not cheap...!
Are LSQs required for correctness, or is it a tradeoff for throughput? I was under the impression the way we thread control signals through should preserve the original program ordering (at the cost of throughput). But I've only tested pretty simple circuits, so maybe one could think up a counter example.
I was under the impression the way we thread control signals through should preserve the original program ordering (at the cost of throughput).
the latter - and you are correct, we do sequentially thread control signals through a join to ensure correctness: https://github.com/llvm/circt/blob/main/lib/Conversion/StandardToHandshake/StandardToHandshake.cpp#L1553-L1562
Thanks for the clarification.
This is not an easy problem to solve. I sense that the core of the problem lies in the fact that memory complexity is essentially fully outsourced to callers of the circuit.
Fully agree with this part. @Dinistro thanks for offering to push on this, but if you end up declaring this out of scope, that sounds quite reasonable. We've elsewhere talked about other mechanisms for interfacing to Handshake circuits, and have been having some offline conversations about ESI services as a potential way to interface to "things", and I personally hope Handshake can be one of those things.
Thanks for pointing this out to me @mikeurbach. I already saw some parts of this, but didn't go into too much detail. The system that we are currently targeting has an XDMA interface exposed with AXI4 streams, using ESI services to interface with that might be a reasonable approach. I'll have to talk to some people that are more familiar with the system and the interfaces, though.
Passing the following code to
--lower-std-to-handshake
produces a weird output.Output:
As you can see, a
memref.alloc()
remains, which is not expected and thus breaks further transformations.I would expect that a correct lowering would construct a
MemoryOp
in the@normal_mem
function. How this exactly interacts with the external memory op is not clear to me, as this will cause cross function dependencies depending on the amount of memory operations in@external_mem
.