tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
478 stars 77 forks source link

Add fabric queue to EDM as its channel implementation #13392

Open SeanNijjar opened 1 month ago

SeanNijjar commented 1 month ago

Restructure the packet_queue structures from fast dispatch for reuse in EDM. Along the way, cleanup the interfaces so the queue is more amenable to optimization and specialization by sender/receiver configurations (e.g. core type and transport medium - noc or ethernet).

Queue Restructure

I want to restructure the packet queue for this task for a handful of reasons:

  1. The interface to read/write from the queues is obfuscated and mixed in with the underlying container. I want to separate the interface from the container to simplify the implementation and make it more extensible.
    • The design is more reasonable because we can have a generic interface (which would implement something similar to the rd/wr ptr updates of today) but then conditionally enable specialized behaviour for improved perf
    • Let's be real, packet_queue could be improved quite a bit here
  2. This separation of implementation and interface will let us perform extensive debug and bringup on host for a large part of the functionality
  3. We can conditionally specialize interface imlpementation by src/dest core type combinations
    • An optimization pathway while keeping things modular (today things are entangled and hard to reason about/optimize)
  4. Some logic (e.g. when exactly to call eth_send_packet is buried too low level for the main loops to manage appropriately. For optimized implementations on ethernet, the queue reader/writer interfaces need to be thinner so they can more easily do things like check for txcmdq fullness and wait/queue up the eth commands for later when the cmdq has space 5) We need these APIs formalized anyways to provide a consistent and modular interface for user kernels (e.g. CCLs) 6) Remove stream register usage (another inflexibility whose benefit is dubious)

High Level Design (Queue Interfaces)

Define remote endpoint type

enum class RemoteQueueCoreType : uint8_t{
    // The interface is communicating with a queue on a worker core on the local chip,
    // the initiator will usually be worker or ethernet core on the local chip
    LOCAL_CHIP_WORKER,

    // The interface is communicating with a queue on an ethernet core on the local chip,
    // typically this will be a worker talking to an ethernet core
    LOCAL_CHIP_ETHERNET,

    // The interface is communicating with a queue on a remote chip,
    // over the ethernet link. This will always be an ethernet core
    REMOTE_CHIP_ETHERNET,
};

Define the reader interface:

template <RemoteQueueCoreType tx_q_core_type, RemoteQueueCoreType rx_q_core_type>
struct packet_queue_receiver_t {
   // generic implementation which reflects the remote q ptr updates found today
   // but only care about the parts that do flow control
   // i.e. send_first_level_ack, send_second_level_ack, check_new_data, ...

   void advance_queue_remote_rptr_sent(uint32_t num_words);

   void advance_queue_remote_rptr_cleared(uint32_t num_words);
};

template <> struct packet_queue_receiver_t<LOCAL_CHIP_WORKER, LOCAL_CHIP_WORKER> {  /* optional specialization for optimization*/ }

// ... Note we can incrementally enable these over time - no explicit need to implement every possible combination

template <> struct packet_queue_receiver_t<REMOTE_CHIP_ETHERNET, REMOTE_CHIP_ETHERNET> {  /* optional specialization for optimization*/ }

Define the writer interface - analogous to reader above except write side

template <RemoteQueueCoreType tx_q_core_type, RemoteQueueCoreType rx_q_core_type>
struct packet_queue_receiver_t {
   // generic implementation which reflects the remote q ptr updates found today
   // but only care about the parts that do flow control
   // i.e. wptr update

   void advance_queue_remote_wptr(uint32_t num_words)
};

Implement the packet_queue container

This will be a basic container that only does pointer update management, returns emptiness/fullness, etc.

ubcheema commented 1 month ago

Sounds good. Can you clarify what these mean w.r.t packet_queue_receiver_t

template <RemoteQueueCoreType tx_q_core_type, RemoteQueueCoreType rx_q_core_type> struct packet_queue_receiver_t

Is tx_q_core the sender to packet_queue_receiver_t and rx_q_core a receiver (downstream) of packet_queue_receiver_t

Does it make sense to use upstream/downstream instead of tx/rx?

SeanNijjar commented 1 month ago

Thanks for the questions. I will produce some diagrams to make it clearer.

For the type names yes we can change it. For receiver it could be upstream_core_type and my_core_type where the receiver interface implements the destination side of a queue writer reader connection. For example, if a worker writes into a queue on Ethernet, the receiver interface would be implemented on the Ethernet core. If that Ethernet core further forwarded data, it would also implement a writer interface. (At that point it may make sense to have a fused forwarder implementation: input queue, output queue, forwarder queue)

SeanNijjar commented 1 month ago

I was thinking of something like this:

image

Although thinking about it more, it probably makes sense to have 3 types of packet_queue interfaces: sender, receiver, forwarder which start startpoints (e.g. centers or workers), endpoints (final center or workers), and hops along a route, respectively.