tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
300 stars 23 forks source link

Offload Tunneller Component To Autonomous Streams #7724

Open SeanNijjar opened 2 months ago

SeanNijjar commented 2 months ago

This issue tracks the work to replace the tunneller with autonomous streams so that the dispatcher and return path can continue to use the ethernet link to send data to/from remote chips while also making the ethernet core available for kernel usage.

Lightly annotated diagram: image

Design Points

Tunneller Stream (Sender):

Muxer Stream:

Demuxer Stream:

Tuning:

SeanNijjar commented 2 months ago

One issue we recently identified with this approach is that there is no hard size limit on each packet coming into muxer. Any given packet can be larger than the muxer buffer size and it will have only one header for the entire packet - muxer will simply wrap around and read a 4k page at a time.

This is problematic for a stream based flow because streams will (start to) send the entire payload as soon as it's available, which means by definition they can't support packet size > buffer size because a packet that large can never fully reside within the buffer at any given time snapshot.

I think to support this properly, we'd probably need to split packets to max(packet_size, muxer_buffer_size). This also means that we want to commit the entire packet to L1 before we initiate the stream based noc write. (e.g. for an 8k payload, we must wait for the full 8K (both pages) to land in L1 before we can start sending the first page.)

@davorchap, @imatosevic, @pgkeller

SeanNijjar commented 1 month ago

Other potential hiccup we need to be aware of is that the noc reads/writes must be 32B aligned, but stream message headers are only 16B, so we need to add 16B additional padding after the header so we can properly copy-in/copy-out to/from circular buffers

davorchap commented 1 month ago

Other potential hiccup we need to be aware of is that the noc reads/writes must be 32B aligned, but stream message headers are only 16B, so we need to add 16B additional padding after the header so we can properly copy-in/copy-out to/from circular buffers

L1 to L1 only 16B aligned , which is what we need for tunnelling

Only DRAM read/write need 32B

SeanNijjar commented 2 weeks ago

Hitting some annoying code size issues when integrating the stream read/write into packet_mux and packet_demux...

pgkeller commented 2 weeks ago

Hitting some annoying code size issues when integrating the stream read/write into packet_mux and packet_demux...

Sean - I think the mux/demux are running on brisc at present which has a 10K size limit. ncrisc has a 16K size limit, maybe run there? and yes, this needs to be fixed, coming soon (increase limit for brisc)

SeanNijjar commented 1 week ago

Providing an update since it's been a few days. I wasn't able to resolve the code size issue by moving to ncrisc but I did find that a code pattern that was added lead to the massive bloat and I was able to remove it. I didn't inspect the assembly but the change suggested that the compiler was previously able to concretize the packet queue classes and get rid of all the VFT lookup and related code but with my change it could no longer do that.

I've fixed that and I'm proceeding with integration. It was going somewhat smoothly in that I've been able to establish handshakes between packet mux, relay streams, and packet demux and that I've also seen some number of messages flow from packet_mux all the way to packet_demux (though the packet demux code doesn't understand message clearing and progress; I've just stumbled upon a bunch of internal state that seemingly straddles a large part of the main loop code paths and that packet demux relies on (via the input queue class). So while I originally thought it would be a pretty simple case of dropping in the stream read/write calls in a few isolated spots, I'm not seeing that I either need to start exposing a lot more internal state of the streams to the code (like buffer offsets) or selectively disable a handful of code paths when dealing with a stream endpoint.

Both will require some careful study of the code here to make sure I don't mess anything up!