tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
471 stars 75 forks source link

[Bug Report] TTI_SFPSTORE with instr_mod = 2 return a tile with 512 bfloat16, 512 zero value #14213

Open BuiChiTrung opened 3 weeks ago

BuiChiTrung commented 3 weeks ago

Describe the bug During the implementation of uniform operation in ttnn, I create a SFPU API on Wormhole to generate random number within a specific range [from, from + scale]. The SFPU API contains a set of tensix instructions as describe below. For details, check this PR.

inline void rand(uint32_t from, uint32_t scale) {
    TT_SFPLOADI(p_sfpu::LREG1, 10, scale & 0xFFFF);
    TT_SFPLOADI(p_sfpu::LREG1, 8, scale >> 16);
    TT_SFPLOADI(p_sfpu::LREG2, 10, from & 0xFFFF);
    TT_SFPLOADI(p_sfpu::LREG2, 8, from >> 16);

#pragma GCC unroll 0
    for (int d = 0; d < 8; d++) {
        // Generate random float
        TTI_SFPMOV(0, 9, p_sfpu::LREG0, 8);

        // Unset sign bit and Set exponent to 127 to ensure the float is within the range [1, 2).
        // lreg0.sign = 0
        // lreg0 = {sign: 0, exponent: 127, mantissa: lreg0.mantissa }
        TTI_SFPSETSGN(0, p_sfpu::LREG0, p_sfpu::LREG0, 1);
        TTI_SFPSETEXP(127, p_sfpu::LREG0, p_sfpu::LREG0, 1);

        // -1 to ensure the float is within the range [0, 1).
        // lreg0 = lreg0 - 1
        TTI_SFPADDI(0xbf80 /*-1*/, p_sfpu::LREG0, 0);
        TTI_SFPNOP;

        // Scale the float from [0, 1) to [from, from + scale)
        // lreg0 = lreg0 * scale + from
        TTI_SFPMAD(p_sfpu::LREG0, p_sfpu::LREG1, p_sfpu::LREG2, p_sfpu::LREG0, 1);
        TTI_SFPNOP;

        TTI_SFPSTORE(0, 2, 3, 0);  // bfloat16, error
        // TTI_SFPSTORE(0, 3, 3, 0); // float32, works fine
        dst_reg++;
    }
}

I tried to support generating both random bfloat16 and float32 only by changing the instr mod of TTI_SFPSTORE. I expected changing instr mod would only converting float32 from sfpu LREG to bfloat16/float32 in tensix core dst register.

To Reproduce Steps to reproduce the behavior:

  1. Checkout this branch: https://github.com/tenstorrent/tt-metal/tree/issues/sfpstore-instr
  2. Turn on DPRINT on core 0,0: export TT_METAL_DPRINT_CORES=0,0
  3. Run uniform pytest: pytest ./tests/ttnn/unit_tests/operations/test_uniform.py
  4. Scroll up to see the DPRINT log in the writer kernel (it's simple log each value in a CB tile which contains value copied from dst reg): ttnn/cpp/ttnn/operations/uniform/device/kernels/writer_uniform.cpp

Expected behavior TTI_SFPSTORE with mod 2 would allow me to generate a tile of random bfloat16 as it works to generate a tile of float32 using mod 3.

Screenshots

Log when DPRINT a tile to generate a tile of random bfloat16.

image

Please complete the following environment information:

Additional context I also try SFPSTORE with instr mode 0. With mode 0, I can get a full tile of bfloat16 and float32. However, there are 2 issues:

  1. Distinct elems generated reduce, only about 300 distinct elems accross the tensor of [512, 512].
  2. Some random number can be greater than from + scale.

Just take a note here as I don't know if this is an expected behavior or a bug. Issues can be reproduced simply by changing the instr mod in TTI_SFPSTORE to 0.

amahmudTT commented 3 weeks ago

As per discussion:

TO is reached when it's positive. If it was reached with a negative range, then the reason could be trivial. So this needs to be investigated. The sfpu API create random number in range [from, from + scale]. To make sure that generated number < to (from +scale), in the uniform program factory, an epsilon: -1e6 is subtracted

const float eps = 1e-6;
        union {
            float f;
            uint32_t u;
        } f2u_from, f2u_to;
        f2u_from.f = operation_attributes.from;
        f2u_to.f = operation_attributes.to - eps;  // -eps make sure that generated number is < operation_attributes.to

But with mod 0, to is reached.