Open SeanNijjar opened 2 months ago
One issue we recently identified with this approach is that there is no hard size limit on each packet coming into muxer. Any given packet can be larger than the muxer buffer size and it will have only one header for the entire packet - muxer will simply wrap around and read a 4k page at a time.
This is problematic for a stream based flow because streams will (start to) send the entire payload as soon as it's available, which means by definition they can't support packet size > buffer size because a packet that large can never fully reside within the buffer at any given time snapshot.
I think to support this properly, we'd probably need to split packets to max(packet_size, muxer_buffer_size)
. This also means that we want to commit the entire packet to L1 before we initiate the stream based noc write. (e.g. for an 8k payload, we must wait for the full 8K (both pages) to land in L1 before we can start sending the first page.)
@davorchap, @imatosevic, @pgkeller
Other potential hiccup we need to be aware of is that the noc reads/writes must be 32B aligned, but stream message headers are only 16B, so we need to add 16B additional padding after the header so we can properly copy-in/copy-out to/from circular buffers
Other potential hiccup we need to be aware of is that the noc reads/writes must be 32B aligned, but stream message headers are only 16B, so we need to add 16B additional padding after the header so we can properly copy-in/copy-out to/from circular buffers
L1 to L1 only 16B aligned , which is what we need for tunnelling
Only DRAM read/write need 32B
Hitting some annoying code size issues when integrating the stream read/write into packet_mux
and packet_demux
...
Hitting some annoying code size issues when integrating the stream read/write into
packet_mux
andpacket_demux
...
Sean - I think the mux/demux are running on brisc at present which has a 10K size limit. ncrisc has a 16K size limit, maybe run there? and yes, this needs to be fixed, coming soon (increase limit for brisc)
Providing an update since it's been a few days. I wasn't able to resolve the code size issue by moving to ncrisc but I did find that a code pattern that was added lead to the massive bloat and I was able to remove it. I didn't inspect the assembly but the change suggested that the compiler was previously able to concretize the packet queue classes and get rid of all the VFT lookup and related code but with my change it could no longer do that.
I've fixed that and I'm proceeding with integration. It was going somewhat smoothly in that I've been able to establish handshakes between packet mux, relay streams, and packet demux and that I've also seen some number of messages flow from packet_mux all the way to packet_demux (though the packet demux code doesn't understand message clearing and progress; I've just stumbled upon a bunch of internal state that seemingly straddles a large part of the main loop code paths and that packet demux relies on (via the input queue class). So while I originally thought it would be a pretty simple case of dropping in the stream read/write calls in a few isolated spots, I'm not seeing that I either need to start exposing a lot more internal state of the streams to the code (like buffer offsets) or selectively disable a handful of code paths when dealing with a stream endpoint.
Both will require some careful study of the code here to make sure I don't mess anything up!
This issue tracks the work to replace the tunneller with autonomous streams so that the dispatcher and return path can continue to use the ethernet link to send data to/from remote chips while also making the ethernet core available for kernel usage.
Lightly annotated diagram:![image](https://github.com/tenstorrent/tt-metal/assets/1865303/61444af1-aa6c-41e6-b218-2eb00d31bcd0)
Design Points
Tunneller Stream (Sender):
Muxer Stream:
next_phase_dst_change=true
to align withnext_phase_dst_change=true
on the tunneller streamDemuxer Stream:
Tuning: