Offload Tunneller Component To Autonomous Streams

SeanNijjar commented 2 months ago

This issue tracks the work to replace the tunneller with autonomous streams so that the dispatcher and return path can continue to use the ethernet link to send data to/from remote chips while also making the ethernet core available for kernel usage.

Lightly annotated diagram:

Design Points

Tunneller Stream (Sender):

max tiles per phase (2k)
MUST HAVE >= 2 phases
- need > 1 phase to make the phase ID change to avoid hang scenario)
- phase 1: auto-cfgs to phase 2
- next_phase_src_change = false
- next_phase_dest_change = false
- phase 2: autocfg back to phase on
- next_phase_src_change = is_sender_tunneller ? true : true
  - Can we keep it false for receiver side?
- next_phase_dest_change = is_sender_tunneller ? true : true
  - Can we keep it false for sender side?

Muxer Stream:

Each "tile"/phase is a fw induced stream copy (FW will program a new phase for every write from muxer -> tunneller)
- next_phase_dst_change=false
- After N tiles/messages, set next_phase_dst_change=true to align with next_phase_dst_change=true on the tunneller stream
- Muxer counts number of messages sent to know when to do dest_change (to drain stream buffer, reset stream pointers to 0, including tile header buffer)
- Muxer could use unused stream registers to track # messages sent (avoids globals)
- Gotcha/Watchout: Noc copies don't do buffer wraparounds but streams do, so next address calculations made by Muxer must be updated to do proper wraparound
- Gotcha/Watchout: Message header in muxer holds num 4B words, not num 16B words, so count needs to be adjusted when programming stream

Demuxer Stream:

Similar but reverse of Muxer

Tuning:

[ ] tune tile header buffer: sweep phase tile counts that minimize amount of tile header buffering relative to throughput
- tile header buffer doesn't wrap so we need to do a remote src/dest change at the end of the phase in order to do a reset to the start of the tile header buffer
- this incurs a perf overhead because we need to wait for the stream buffer to drain (rdptr to catch up to wrptr), which can introduce bubbles
[ ] tune data buffer size (minimize)

SeanNijjar commented 2 months ago

One issue we recently identified with this approach is that there is no hard size limit on each packet coming into muxer. Any given packet can be larger than the muxer buffer size and it will have only one header for the entire packet - muxer will simply wrap around and read a 4k page at a time.

This is problematic for a stream based flow because streams will (start to) send the entire payload as soon as it's available, which means by definition they can't support packet size > buffer size because a packet that large can never fully reside within the buffer at any given time snapshot.

I think to support this properly, we'd probably need to split packets to max(packet_size, muxer_buffer_size). This also means that we want to commit the entire packet to L1 before we initiate the stream based noc write. (e.g. for an 8k payload, we must wait for the full 8K (both pages) to land in L1 before we can start sending the first page.)

@davorchap, @imatosevic, @pgkeller

SeanNijjar commented 1 month ago

Other potential hiccup we need to be aware of is that the noc reads/writes must be 32B aligned, but stream message headers are only 16B, so we need to add 16B additional padding after the header so we can properly copy-in/copy-out to/from circular buffers

davorchap commented 1 month ago

Other potential hiccup we need to be aware of is that the noc reads/writes must be 32B aligned, but stream message headers are only 16B, so we need to add 16B additional padding after the header so we can properly copy-in/copy-out to/from circular buffers

L1 to L1 only 16B aligned , which is what we need for tunnelling

Only DRAM read/write need 32B

SeanNijjar commented 2 weeks ago

Hitting some annoying code size issues when integrating the stream read/write into packet_mux and packet_demux...

pgkeller commented 2 weeks ago

Hitting some annoying code size issues when integrating the stream read/write into packet_mux and packet_demux...

Sean - I think the mux/demux are running on brisc at present which has a 10K size limit. ncrisc has a 16K size limit, maybe run there? and yes, this needs to be fixed, coming soon (increase limit for brisc)

SeanNijjar commented 1 week ago

Providing an update since it's been a few days. I wasn't able to resolve the code size issue by moving to ncrisc but I did find that a code pattern that was added lead to the massive bloat and I was able to remove it. I didn't inspect the assembly but the change suggested that the compiler was previously able to concretize the packet queue classes and get rid of all the VFT lookup and related code but with my change it could no longer do that.

I've fixed that and I'm proceeding with integration. It was going somewhat smoothly in that I've been able to establish handshakes between packet mux, relay streams, and packet demux and that I've also seen some number of messages flow from packet_mux all the way to packet_demux (though the packet demux code doesn't understand message clearing and progress; I've just stumbled upon a bunch of internal state that seemingly straddles a large part of the main loop code paths and that packet demux relies on (via the input queue class). So while I originally thought it would be a pretty simple case of dropping in the stream read/write calls in a few isolated spots, I'm not seeing that I either need to start exposing a lot more internal state of the streams to the code (like buffer offsets) or selectively disable a handful of code paths when dealing with a stream endpoint.

Both will require some careful study of the code here to make sure I don't mess anything up!

tenstorrent / tt-metal