tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
460 stars 69 forks source link

Additional compute kernel cleanup #7736

Open rtawfik01 opened 6 months ago

rtawfik01 commented 6 months ago

Clean up items for compute kernels

The approximate flag for sfpu kernels has variations where some kernels call the flag as a runtime parameter:

ALWI void gelu_tile_init(bool fast_and_approx=true) {
    if (fast_and_approx) {
        MATH(( llk_math_eltwise_unary_sfpu_gelu_init<true>() ));
    } else {
        MATH(( llk_math_eltwise_unary_sfpu_gelu_init<false>() ));
    }
}

or

ALWI void exp_tile(uint32_t idst, bool fast_and_approx=false) {
    if ( fast_and_approx )
        MATH(( llk_math_eltwise_unary_sfpu_exponential<true>(idst) ));
    else
        MATH(( llk_math_eltwise_unary_sfpu_exponential<false>(idst) ));
 }

while other sfpu kernels have it as compile time APPROX flag, generated in the defines:

ALWI void recip_tile(uint32_t idst) {
    MATH(( llk_math_eltwise_unary_sfpu_reciprocal<APPROX>(idst) ));
}

Is there a reason we have variations? is it required to change approximate flag during runtime?

I think we should have all the sfpu functions use the APPROX compile time flag if possible. And if users need to have multiple ops, with a mix of approx modes, then approx needs to be templated for those ops instead. Moreover, the above is prone to bugs, because the functions that have approx as a runtime variable, also have APPROX= false generated in the compute kernel defines, but the flag is set to true for the reader/writer kernels. This might cause problems while debugging.

Add template fix, and removed hard-coded init approx here: #8004

The reduce ops in the api kernel:tt_metal/include/compute_kernel_api/reduce.h have reduce_op and dim sent as function parameters, but they are completely ignored, and the REDUCE_OP and REDUCE_DIM defines are used instead:

ALWI void reduce_init_short(PoolType reduce_op, ReduceDim dim, uint32_t icb, uint32_t icb_scaler, uint32_t ocb = 16) {

    UNPACK(( llk_unpack_AB_reduce_init<REDUCE_DIM>(icb, icb_scaler) ));
    MATH(( llk_math_reduce_init<REDUCE_OP, REDUCE_DIM, MATH_FIDELITY>() ));
    PACK(( llk_pack_reduce_config_v2<REDUCE_DIM, false, false, DST_ACCUM_MODE>(ocb) ));
}

The first part of the cleanup has been merged here: https://github.com/tenstorrent/tt-metal/pull/7585

But the init functions:

template<bool at_start>
ALWI void reduce_init(PoolType reduce_op, ReduceDim dim, uint32_t icb, uint32_t icb_scaler, uint32_t ocb = 16)
{
    UNPACK(( llk_setup_operands() ));
    UNPACK(( llk_unpack_AB_hw_configure_disaggregated<DST_ACCUM_MODE>(icb, icb_scaler) ));
    UNPACK(( llk_unpack_AB_reduce_init<REDUCE_DIM>(icb, icb_scaler) ));

    MATH(( llk_math_reduce_init<REDUCE_OP, REDUCE_DIM, MATH_FIDELITY>() ));
    MATH(( llk_math_pack_sync_init<DST_ACCUM_MODE>() ));

    PACK(( llk_pack_init() ));
    PACK(( llk_pack_reduce_config_v2<REDUCE_DIM, at_start, false, DST_ACCUM_MODE>(ocb) ));
    PACK(( llk_setup_outputs() ));
    PACK(( llk_pack_dest_init<false, DST_ACCUM_MODE>() ));
}

// Delta from binary_op_init_common
template<bool at_start>
ALWI void reduce_init_delta(PoolType reduce_op, ReduceDim dim, uint32_t ocb = 16, uint32_t icb0 = 0, uint32_t icb1 = 1)
{
    // FIXME: API Update needed in compute kernel?
    UNPACK(( llk_unpack_AB_reduce_init<REDUCE_DIM>(icb0, icb1) ));

    MATH(( llk_math_reduce_init<REDUCE_OP, REDUCE_DIM, MATH_FIDELITY>() ));

    PACK(( llk_pack_reduce_config_v2<REDUCE_DIM, at_start, false, DST_ACCUM_MODE>(ocb) ));
}

still have this problem, and are called in a lot of test files. The defines have been templated instead (for users that need to call multiple reduce ops in the same kernel), and the remaining item is PoolType reduce_op, ReduceDim dim, parameters need to be removed from all the test files.

This include:

#include "noc_nonblocking_api.h"

was included in ckernel_sfpu.h when it included all sfpu functions, because the dropout function needed it. But now that sfpu/ckernel_sfpu_*.h functions are all separated, that include should be removed from all functions that don't need it. For example in sfpu/ckernel_sfpu_abs.h

#pragma once

#include "ckernel.h"
#include "ckernel_defs.h"
#include "noc_nonblocking_api.h"

using namespace sfpi;

namespace ckernel {
namespace sfpu {

template <bool APPROXIMATION_MODE, int ITERATIONS = 8>
inline void calculate_abs()
{
    // SFPU microcode
    for (int d = 0; d < ITERATIONS; d++)
    {
        vFloat v = dst_reg[0];
        dst_reg[0] = sfpi::abs(v);
        dst_reg++;
    }
}

The include is unused.

In addition, it should also be removed from ckernel_sfpu_dropout.h. And this functionality that dropout needs:

    FWLOG1("calculate_dropout() -- input seed:%x", p2);

    uint32_t noc_id_reg = NOC_CMD_BUF_READ_REG(0, 0, NOC_NODE_ID);

    uint16_t my_x = noc_id_reg & NOC_NODE_ID_MASK;
    uint16_t my_y = (noc_id_reg >> NOC_ADDR_NODE_ID_BITS) & NOC_NODE_ID_MASK;

    uint16_t per_tensix_input_seed = p2 ^ (my_x << my_y);

    FWLOG1("calculate_dropout() -- calculated seed:%x", per_tensix_input_seed);

Should either be pulled into the tt-llk-* submodules, or another way should be found. Ckernel files should not be including files that are in the higher level apis. This can also cleanup up the includes in the jit_build

rtawfik01 commented 6 months ago

@ttmtrajkovic fyi

rtawfik01 commented 3 months ago

Merged for task 2: https://github.com/tenstorrent/tt-metal/pull/10060