tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
464 stars 72 forks source link

Test Infra: enable arbitrary/random delay introduction to noc calls (semaphore_inc, async write, async read, etc.) #6303

Open SeanNijjar opened 8 months ago

SeanNijjar commented 8 months ago

To expand test variability and increase likelihood of catching hangs during testing (particularly for running determinism tests), allow noc apis to, under the hood, introduce artificial delays. These delays should be lightly configurable from host side.

For example, host can provide a fixed delay value, delay per API entrypoint, or small set of random delays (maybe it can store this sequence of delays in L1 to loop through over time.

Here's an example to convey the idea (Note I put the delay at the start, but I think there are usecases for having it at the beginning and end of the function):

static uint32_t i = 0;  // could be shared across all noc api calls that need random delays. For worker cores, needs to be threadsafe
constexpre uint34_t rand_delay_list_size = 32;
std::array<rand_delay_list_size, uint32_t> rand_delays; // can be populated by host

inline
void noc_semaphore_inc(uint64_t addr, uint32_t incr) {
    #ifdef SYNTHETIC_DELAYS
    uint32_t delay = delay_counts[i];
    i = increment_wraparound(i, rand_delay_list_size);
    for (uint32_t j = 0; j < delay; j++) {
      std::asm("");
    }
    #endif
    noc_fast_atomic_increment(noc_index, NCRISC_AT_CMD_BUF, addr, NOC_UNICAST_WRITE_VC, incr, 31 /*wrap*/, false /*linked*/);
}
SeanNijjar commented 8 months ago

FYI @jliangTT @pgkeller - not sure where the right ownership is for this but I figured you guys would be a good starting point. This is for improved testing methodology but requires some lower level improvements to get the benefit.

jliangTT commented 7 months ago

i don't really know which project board to add this . but multi-device seems to be a good place for this to start.