[Bug Report] TTI_SFPSTORE with instr_mod = 2 return a tile with 512 bfloat16, 512 zero value

Describe the bug During the implementation of uniform operation in ttnn, I create a SFPU API on Wormhole to generate random number within a specific range [from, from + scale]. The SFPU API contains a set of tensix instructions as describe below. For details, check this PR.

inline void rand(uint32_t from, uint32_t scale) {
    TT_SFPLOADI(p_sfpu::LREG1, 10, scale & 0xFFFF);
    TT_SFPLOADI(p_sfpu::LREG1, 8, scale >> 16);
    TT_SFPLOADI(p_sfpu::LREG2, 10, from & 0xFFFF);
    TT_SFPLOADI(p_sfpu::LREG2, 8, from >> 16);

#pragma GCC unroll 0
    for (int d = 0; d < 8; d++) {
        // Generate random float
        TTI_SFPMOV(0, 9, p_sfpu::LREG0, 8);

        // Unset sign bit and Set exponent to 127 to ensure the float is within the range [1, 2).
        // lreg0.sign = 0
        // lreg0 = {sign: 0, exponent: 127, mantissa: lreg0.mantissa }
        TTI_SFPSETSGN(0, p_sfpu::LREG0, p_sfpu::LREG0, 1);
        TTI_SFPSETEXP(127, p_sfpu::LREG0, p_sfpu::LREG0, 1);

        // -1 to ensure the float is within the range [0, 1).
        // lreg0 = lreg0 - 1
        TTI_SFPADDI(0xbf80 /*-1*/, p_sfpu::LREG0, 0);
        TTI_SFPNOP;

        // Scale the float from [0, 1) to [from, from + scale)
        // lreg0 = lreg0 * scale + from
        TTI_SFPMAD(p_sfpu::LREG0, p_sfpu::LREG1, p_sfpu::LREG2, p_sfpu::LREG0, 1);
        TTI_SFPNOP;

        TTI_SFPSTORE(0, 2, 3, 0);  // bfloat16, error
        // TTI_SFPSTORE(0, 3, 3, 0); // float32, works fine
        dst_reg++;
    }
}

I tried to support generating both random bfloat16 and float32 only by changing the instr mod of TTI_SFPSTORE. I expected changing instr mod would only converting float32 from sfpu LREG to bfloat16/float32 in tensix core dst register.

Mod 3: cast float32 to float32. Everything works fine.
Mod 2: cast float32 to bfloat16. As I DPRINT random tile, the first 512 elements is fine, the other 512 elems are zeros.

To Reproduce Steps to reproduce the behavior:

Checkout this branch: https://github.com/tenstorrent/tt-metal/tree/issues/sfpstore-instr
Turn on DPRINT on core 0,0: export TT_METAL_DPRINT_CORES=0,0
Run uniform pytest: pytest ./tests/ttnn/unit_tests/operations/test_uniform.py
Scroll up to see the DPRINT log in the writer kernel (it's simple log each value in a CB tile which contains value copied from dst reg): ttnn/cpp/ttnn/operations/uniform/device/kernels/writer_uniform.cpp

Expected behavior TTI_SFPSTORE with mod 2 would allow me to generate a tile of random bfloat16 as it works to generate a tile of float32 using mod 3.

Screenshots

Log when DPRINT a tile to generate a tile of random bfloat16.

Please complete the following environment information:

OS: Ubuntu 20.04.
Device: Wormhole.
Version of software: branch

Additional context I also try SFPSTORE with instr mode 0. With mode 0, I can get a full tile of bfloat16 and float32. However, there are 2 issues:

Distinct elems generated reduce, only about 300 distinct elems accross the tensor of [512, 512].
Some random number can be greater than from + scale.

Just take a note here as I don't know if this is an expected behavior or a bug. Issues can be reproduced simply by changing the instr mod in TTI_SFPSTORE to 0.

tenstorrent / tt-metal

[Bug Report] TTI_SFPSTORE with instr_mod = 2 return a tile with 512 bfloat16, 512 zero value #14213