Ungaurded Ethernet Context Switches Negatively Impact CCL Performance - Particular Device Runtime Variance

SeanNijjar commented 3 weeks ago

Currently, the erisc.cc erisc kernel wrapper code performs (rightfully) context switches to the lower level routing FW while waiting for work.

However, the current implementation will always context switch as soon as it sees there is no run message for the core.

    while (routing_info->routing_enabled) {
        // FD: assume that no more host -> remote writes are pending
        if (mailboxes->launch.go.run == RUN_MSG_GO) {
            DeviceZoneScopedMainN("ERISC-FW");
            DeviceZoneSetCounter(mailboxes->launch.kernel_config.host_assigned_id);

            firmware_config_init(mailboxes, ProgrammableCoreType::ACTIVE_ETH, DISPATCH_CLASS_ETH_DM0);

            DEBUG_STATUS("R");
            kernel_init();
        } else {
            internal_::risc_context_switch();
        }
    }
    internal_::disable_erisc_app();

This is pretty massively detrimental to erisc user perf (and CCL perf, more generally) as it means that there is a very high likelihood of multiple cores context switching across the op on startup. When aggregated across the entire op, this effect compounds and can lead to many thousands of cycles wasted at startup. Additionally, it introduces large startup skew between eriscs on the same core.

Overall, after prototyping a non-mergeable change to introduce an idle timer before context switching in this code block, I was able to massively reduce CCL cycle times - especially for smaller invocations - 2-3 thousand cycles on the low end (for minumum times) and many thousands on the high end (max cycle count).

e.g. example of experimental change:

    uint32_t idle_count = 0;                                     // new
    constexpr max_idle_count = 1000000000;     // new
    while (routing_info->routing_enabled) {
        // FD: assume that no more host -> remote writes are pending
        if (mailboxes->launch.go.run == RUN_MSG_GO) {
            DeviceZoneScopedMainN("ERISC-FW");
            DeviceZoneSetCounter(mailboxes->launch.kernel_config.host_assigned_id);

            firmware_config_init(mailboxes, ProgrammableCoreType::ACTIVE_ETH, DISPATCH_CLASS_ETH_DM0);

            DEBUG_STATUS("R");
            kernel_init();
            idle_count = 0;                                  // new
        } else {
            if (idle_count > max_idle_count) {  // new
                idle_count = 0;                              // new
                internal_::risc_context_switch();
            } else {                                               // new
               idle_count++;                                // new
           }                                                          // new
        }

    }
    internal_::disable_erisc_app();

Note that you can't naively just add an always active idle counter because that will block the ethernet FW routing from working correctly. Instead, we should only enable this after we are absolutely sure the full FD path has been brought up. Additionally, we may wish to expose some low level FW signals (e.g. work available) to this level of FW so it can know it shoudl context switch.

SeanNijjar commented 3 weeks ago

@davorchap - gave to you to reassign. Since @aliuTT is out for now.

SeanNijjar commented 3 weeks ago

For now I've decided to revisit this priority as I've done a bit of an investigation and found a bigger potential culprit for perf swings. I've also been unable to reproduce the massive gains by getting rid of this the other day (though I still see some noticeable improvement:

500-ns to 1us lower min device times
More importantly, the min-max cycle count range has been reduced on average by several microseconds when looking at a given chip
- represents a nearly halving in the range of the difference between max and min device times of the op on a given chip

tenstorrent / tt-metal

Ungaurded Ethernet Context Switches Negatively Impact CCL Performance - Particular Device Runtime Variance #12395