tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
469 stars 73 forks source link

[Feature Request] Mailbox/pipe API for core-to-core communication #7916

Open marty1885 opened 6 months ago

marty1885 commented 6 months ago

Is your feature request related to a problem? Please describe.

Currently, sending data between Tensix cores means using semaphore and DMA into another core's L1. This is very difficult for most people to work with. Let alone it's is very easy to screw up. It would make developing multi core kernels easier if sending data is easy(-er) then what it's lile now.

Describe the solution you'd like

There's many solution to the problem. Personally I prefer what AMD is doing with MLIR-AIE (ref PDF. Page 47-53)

The following is mocked host and kernel code that I hope demonstrates the principles and the use of the design. There a lot of detail missing but I hope it demonstrates a simplified method of sending data from a core to another.

host:

size_t fp16_tile_size = 2 * 32 * 32;
size_t n_tiles = 4;
CoreCoord core0(0, 0);
CoreCoord core1(0, 1);
CircularBufferConfig cb_src0_config = CircularBufferConfig(
    n_tiles * fp16_tile_size,
    {{
        tt::CB::c_in0,
        tt::DataFormat::Float16_b)
}})
.set_page_size(tt::CB::c_in0, fp16_tile_size);

auto core0_in0 CreateCircularBuffer(program, core0, cb_src0_config);
auto core1_in0 CreateCircularBuffer(program, core1, cb_src0_config);
auto fifo = CreateFIFO(core0_in0, core1_in0);

auto core0_sender = CreateKernel(..., "sender.cpp". ...);
auto core1_receiver = CreateKernel(..., "receiver.cpp". ...);
// sender.cpp
void kernel_main()
{
    uint32_t recever_x = get_arg_val<uint32_t>(0);
    uint32_t recever_y = get_arg_val<uint32_t>(1);
    uint32_t tiles_send = get_arg_val<uint32_t>(2);

    for(int i=0;i<tiles_send;i++) {
        cb_reserve_back(cb_in0, 1);
        uint32_t cb_in0_addr = get_write_ptr(cb_in0);
        // read data into cb0

        pipe_send_tile(recever_x, recever_y, cb_in0_addr);
        // Maybe a sync function here?
        cb_pop_front(cb_in0);
    }
}
// recever.cpp
void kernel_main()
{
    uint32_t tiles_send = get_arg_val<uint32_t>(1);

    for(int i=0;i<tiles_send;i++) {
        pipe_ensure_front(cb_in0);
        // consume the data
        cb_pop_front(cb_in0);
    }
}

Describe alternatives you've considered

There could/should be a long discussion on what's the best solution. But for this proposal, the point is as follows

  1. Kernel developers should not see semaphores and explicit DMA
  2. The API should be simple. We can keep using raw DMA and semaphores when needed
  3. The API should be difficult to misuse.

Additional context Add any other context or screenshots about the feature request here.

jliangTT commented 6 months ago

architectural questions - assigning to the Davor / Jasmina. This was discussed in our meeting about mailbox idea. I will listed this as a nice to have for now and we can bump if the idea gains tractions.