tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
430 stars 59 forks source link

Question regarding CB's support of integer data formats (UInt16, UInt32) #4624

Closed dongjin-na closed 6 months ago

dongjin-na commented 9 months ago

I'm wondering current tt-metal supports integer type format support for CB. (UInt16, UInt32, etc) I revised test_eltwise_unary.cpp and related kernels, then tested.

I made a simple test.

  1. Create input, output CBs with tt::DataFormat::UInt16 or tt::DataFormat::Float16_b
  2. Create a random value tile
  3. Reader kernel reads one tile from DRAM to c_in0 CB.
  4. Compute kernel just copy this tile in c_in0 CB to c_out0 CB.
  5. Writer kernel writer one tile in c_out0 CB to DRAM.
  6. Check output buffer.

Result

Is there something I set wrong or missed?

Here's the modification part of test_eltwise_unary.cpp and eltwise_sfpu.cpp kernel.

// Host code
            SfpuConfig test_config = {
                .num_tiles = num_tiles,
                .tile_byte_size = 1 * 32 * 32 * 2,
                .l1_input_data_format = tt::DataFormat::UInt16,
                .l1_output_data_format = tt::DataFormat::UInt16,
                .cores = core_range_set,
                .sfpu_op = sfpu_op,
                .approx_mode = false
            };

// ......
tt_metal::CircularBufferConfig l1_input_cb_config = tt_metal::CircularBufferConfig(byte_size, {{0, test_config.l1_input_data_format}})
            .set_page_size(0, test_config.tile_byte_size);
        auto l1_input_cb = tt_metal::CreateCircularBuffer(program, core_range, l1_input_cb_config);

        tt_metal::CircularBufferConfig l1_output_cb_config = tt_metal::CircularBufferConfig(byte_size, {{16, test_config.l1_output_data_format}})
            .set_page_size(16, test_config.tile_byte_size);
        auto l1_output_cb = tt_metal::CreateCircularBuffer(program, core_range, l1_output_cb_config);
// compute kernel code
    init_sfpu(tt::CB::c_in0);
    for (uint32_t block_index = 0; block_index < per_core_block_cnt; block_index++) {
        cb_reserve_back(tt::CB::c_out0, per_core_block_dim);
        for(uint32_t tile_index = 0; tile_index < per_core_block_dim; ++tile_index) {
            acquire_dst(tt::DstMode::Half);

            // Pop tile after tile, copy to DST and pack
            cb_wait_front(tt::CB::c_in0, 1);
            copy_tile_init();
            copy_tile(tt::CB::c_in0, 0, 0);
            pack_tile(0, tt::CB::c_out0);
            cb_pop_front(tt::CB::c_in0, 1);
            release_dst(tt::DstMode::Half);
        }
        cb_push_back(tt::CB::c_out0, per_core_block_dim);
    }
    kernel_profiler::mark_time(9998);

Target : WH_B0

jliangTT commented 9 months ago

More context about this: this is part of attempting to implement a suitable RNG of the dropout operation (utilizing the integer arithmetic in the kernel)

jliangTT commented 9 months ago

I will categorize this to P2 for now given that this is associated with the new feature (dropout) that needs to be discuss.

jliangTT commented 8 months ago

Please provide a simple example that we can reproduce and debug

davorchap commented 8 months ago
yugaoTT commented 8 months ago

I tested Uint16 and and Uint32 as inputs, the output is a mix of zeros and garbage. The kernel and host code looks fine, probably the issue is within llks.

yugaoTT commented 8 months ago

In WHB0, directly loading Uint16 data into src registers are not supported, as the FPU supports only (u)int8 data format. Instead we can load Uint16 into dest registers and run SFPU integer ops. In this mode dest has to be setup to int32. Then packer can pack int32 or int8 output.

dongjin-na commented 8 months ago

Dear @yugaoTT,

Instead we can load Uint16 into dest registers and run SFPU integer ops. In this mode dest has to be setup to int32.

Could you share an example?

ttmtrajkovic commented 8 months ago

hey @yugaoTT, @mywoodstock,

With latest uplift of wormhole llks and my changes in the data_format.cpp, we should be able to do datacopy/move and sfpu op with uint16. is there a test that I can use to try this?

jliangTT commented 8 months ago

while we wait for @yugaoTT and @mywoodstock to comment, a good checking point is from Moreh - https://github.com/tenstorrent-metal/tt-metal/pull/4964

yugaoTT commented 8 months ago

yes as @jliangTT mentioned, that is one I used to test out Uint16

ttmtrajkovic commented 7 months ago

UInt16 support for moving data through tensix core should be working. However, in order for sfpu operations to use uint16, it would require sfpi compiler support to recognize uint16 as a separate type and set instr_mode of SFPLOAD and SFPSTORE instructions to 6. @pgkeller, is it possible to recognize uint16 as a type through sfpi currently?

Milos

pgkeller commented 7 months ago

No, not at present. Design for other data types is in progress

jliangTT commented 7 months ago

Probably a stupid question..

UInt16 support for moving data through tensix core should be working. However, in order for sfpu operations to use uint16, it would require sfpi compiler support

So is this still useful, @razorback3 at the current level of support?

nadongguri commented 7 months ago

Dear @jliangTT , to implement a random state, we need these operators that support the UInt32 data type as follows.

razorback3 commented 7 months ago

Probably a stupid question..

UInt16 support for moving data through tensix core should be working. However, in order for sfpu operations to use uint16, it would require sfpi compiler support

So is this still useful, @razorback3 at the current level of support?

No 😢. Please refer to the above answer from @nadongguri.

ttmtrajkovic commented 7 months ago

SFPU doesn't recognize uint32 data format but int32 and all operations you listed are possible with SFPU.

The initial request on this thread comes to support Uint16 so it really has to be broken down into couple of requests:

  1. Uint16 support throughout the tt-metal stack, including compute kernels - Completed, waiting to be merged
  2. Uint16 support in SFPU - in progress
  3. Int32 support throughout the tt-metal stack, including compute kernels - Not started
  4. Int32 support in SFPU - Completed.

Milos

dongjin-na commented 7 months ago

@ttmtrajkovic , thanks for the information.

I have a question.

SFPU doesn't recognize uint32 data format

Is the reason that the SFPU doesn't recognize the uint32 data format because the hardware does not support it?

This issue was created regarding the dropout op implementation. UInt16 support is needed to improve TT's dropout op, and UInt32 support is needed to implement a new dropout op.

Through discussions, we decided to implement a new dropout op so UInt32 support is really needed.

ttmtrajkovic commented 7 months ago

SFPU Hardware doesn't recognize unsigned integer format so once SFPU starts operating with data, everything is signed int32. You can interpret data as 2's complement signed or sign+magnitude (default).

dongjin-na commented 7 months ago

SFPU Hardware doesn't recognize unsigned integer format so once SFPU starts operating with data, everything is signed int32. You can interpret data as 2's complement signed or sign+magnitude (default).

Thanks for the check. I think using int32 to implement a random state with those operations may be possible instead of uint32. As mentioned above, is Int32 support in tt-metal planned after Uint16 support is completed?

ilkoo-lee commented 7 months ago

@ttmtrajkovic Additionally, UINT32 data type processing is not implemented in dataflow_api.h (GET_L1_TILE_SIZE and MUL_WITH_TILE_SIZE), which is required for the index-based nll_loss implementation. (https://github.com/tenstorrent-metal/tt-metal/issues/5633)

Currently, we are using InterleavedAddrGen + get_noc_addr + noc_async_read as a workaround instead of InterleavedAddrGenFast + noc_async_read_tile. (https://github.com/tenstorrent-metal/tt-metal/pull/5634)

ttmtrajkovic commented 7 months ago

Thanks. I will address the support in dataflow_api.h if I hadn’t already. Let me check tomorrow.

could you also share summarize all the requirements with regards to integer support. Hardware can only support Int32 so I assume that target indices should also be represented as int32.

Few other questions:

thanks

Milos

jliangTT commented 7 months ago

Also followed up with @razorback3 in slack and hopefully this adds more clarity:

  1. We need to implement dropout for LLM.
  2. Dropout requires random number generation.
  3. TT's original random number generator is not enough.
  4. Both teams agreed Moreh to develop its own random number generator for dropout.
  5. Moreh's implementation requires uint32 arithmetic.
  6. Milos said the hardware does not support uint32 but supports int32.
  7. Moreh change its plan to implement a random number generator using int32.
  8. Currently, waiting for int32 software support in Metal. Once it is delivered, I think there might be no more blockers for dropout implementation.

@ttmtrajkovic , do you need @razorback3 to break down the requirement of int32 supports further?

ttmtrajkovic commented 7 months ago

Thanks everyone. Int32/UInt32 support is clear to me with regards to dropout implementation. I will work on adding correct APIs to perform: 1) copy data 2) pack data 3) bitwise ops support in sfpu already works

I will fix the dataflow_api.h to support int32 and UInt32, but I am not clear how is target index in int32 going to be used. Int32 format in tensix hardware is in form Sign + Magnitude, so if you're using that as some operator in riscv, it won't work out of the box (you'd need to convert it to 2's complement).

HW is a bit tricky when it comes to manipulating raw integer numbers. And I will think about how to enable UInt32 as well, just for the purpose of SFPU bitwise operations.

Milos

jliangTT commented 7 months ago

hey @ttmtrajkovic , do we have a update here?

jliangTT commented 7 months ago

Spoke to @ttmtrajkovic offline - the current status: UInt16 is supported, Uint32 not yet. The uint32/int32 support is planned for next week.

dongjin-na commented 6 months ago

Hi, is it possible to share any updates on the progress of int32 support? :)

ttmtrajkovic commented 6 months ago

I am working on adding uint32 support so that you can enable dropout calculation.Int32 will follow, although it has been requested to support int32 for doing math on indices On Mar 16, 2024, at 04:26, Dongjin Na @.***> wrote: Hi, is it possible to share any updates on the progress of int32 support? :)

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

dongjin-na commented 6 months ago

Thanks for sharing the status. That's right. I was trying to inquire about the status of uint32 support for dropout op.

Can I have some questions regarding this?

  1. When will the ETA for uint32 support be?
  2. A slightly different question. Is packing between different formats possible? For example, if the DST register is in uint32 format and the CB to pack is uint16, I wonder if type casting is possible for each element.
ttmtrajkovic commented 6 months ago

For 1) - end of this week. For 2) - Packing between different formats is possible, but not for integers. uint32 -> uint16 cannot be done without proper descaling so packing uint32 into uint16 isn't supported since packer cannot do the descaling and range check. you should be able to use sfpu to load uint16, descale (if needed) and round to uint16, then pack out as uint16.

I will have support for 2) ready as well.

dongjin-na commented 6 months ago

Thanks for the check and detailed information provided.

For 1) - end of this week.

It's great to hear that support will be ready by the end of this week.

For 2) - Packing between different formats is possible, but not for integers.

So, is it possible to pack an integer from DST into CB and vice versa?

uint32 -> uint16 cannot be done without proper descaling so packing uint32 into uint16 isn't supported since packer cannot do the descaling and range check. you should be able to use sfpu to load uint16, descale (if needed) and round to uint16, then pack out as uint16.

I will have support for 2) ready as well.

If DST registers can have different data formats for each register (For example, the first tile in the dst register is UInt32, and the second tile is Float32 or bfloat16.) and SFPU can handle them, then I don't think 2) above is necessary. (But when I looked at the microcode, it seems such cases are absent, so I'm unsure whether it's possible.) To explain in more detail: When the RNG uses SFPU to generate a 32-bit integer random value, the range for this value is [-INTMAX, INTMAX].

ttmtrajkovic commented 6 months ago

DST registers have general modes of operation, depending on the size of datums: 32bit or 16bit. If any of the tiles has 32bit format, then 32bit mode has to be selected. If mode is selected to be 32bit, then DST storage is halved and 8 tiles can fit.

You will be able to have input CB in fp32, unpack into DST registers (without loss of precision), consume the tiles in sfpu, and produce output tiles in uint16. The hardware should handle the storage of uint16 elements properly and packer will get configured to pack out in uint16. You have to be aware of the reduced capacity (half DST space is available in 32bit mode), but everything else should work. Although in theory you could have different formats on every tile, you shouldn't be doing that as it would require API to reconfigure data format in packer per tile. What I suppose you will be doing is to have input tiles at one format, and output tiles at some other format, that should work seamlessly.

I am adding the API to unpack fp32/int32 from CB directly into DST (without loss of precision due to unpack into SRC) and then you should see if anything else is missing.

you can do different casting in SFPU:

There's a lot of details here and we've never really used our HW in this way so there might be bugs for some cases, but in general these features should be supported. Once you start using it, we can take a look at the non-working cases.

Does this answer your question?

Milos

dongjin-na commented 6 months ago

I appreciate your detailed answer. Yes, we can take a look at the non-working cases.

jliangTT commented 6 months ago

For 2) - Packing between different formats is possible, but not for integers. uint32 -> uint16 cannot be done without proper descaling so packing uint32 into uint16 isn't supported since packer cannot do the descaling and range check. you should be able to use sfpu to load uint16, descale (if needed) and round to uint16, then pack out as uint16. I will have support for 2) ready as well.

To confirm: @ttmtrajkovic , are you targeting the end of the week for this deliverables as well?

ttmtrajkovic commented 6 months ago

hello @dongjin-na, I apologize for the delay in delivering this, it's now merged in main and you can use UINT32 format and load it into sfpu. You can take a look at the example sfpu: calculate_identity_uint32 in ckernel_sfpu_identity.h, for wormhole_b0.

Merged at: https://github.com/tenstorrent-metal/tt-metal/pull/6796

jliangTT commented 6 months ago

@ttmtrajkovic , the status of this issue is a little weird - it went from closed, to promoting to P0_showstopper, and to re-opened. Is this intentional?

ttmtrajkovic commented 6 months ago

Hey Jason,I can’t comment on previous actions but I’ve completed all the work and it’s up to Moreh to give this a try.I’ve initially closed it but then reopened in order for Moreh team to acknowledge the work.I’m ok either way, but it would be good to see if it works for them.MilosOn Apr 1, 2024, at 19:46, Jason Liang @.***> wrote: @ttmtrajkovic , the status of this issue is a little weird - it went from closed, to promoting to P0_showstopper, and to re-opened. Is this intentional?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

jliangTT commented 6 months ago

got it. @ttmtrajkovic . thanks for the clarification. @razorback3 , please review and close once you are okay.

dongjin-na commented 6 months ago

Hi @ttmtrajkovic, thanks for your work.

I've checked the identity uint32 example you mentioned, and it works well. :) Now, I am writing a kernel using SFPU with a uint32 data format tile. I have some queries regarding using uint32 in SFPU.

fp32_dest_acc_en setting

I've discovered that copy_tile() doesn't work for uint32 CB when the fp32_desc_acc_en flag is false. Setting this flag is required, right?

packing to 'bfloat16' data format CB

Packing the uint32 data format tile calculated in the dst register by SFPU directly into float32 or bfloat16 data format output CB works properly.

However, when intermediate CB is used (e.g., c_intermed0), bfloat16 output CB does not seem to be packed normally. The relevant code examples can be found in the sfpu_uint32 branch of the tt-metal repository. link : https://github.com/tenstorrent-metal/tt-metal/commit/95a2910866b127a203130d25e449b11e78eda3db

(python_env) ubuntu@tt-metal-dev-moreh-wh-9:~/tt-metal$ pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_sfpu.py::test_sfpu_bfloat16_output_is_weird

......
2024-04-02 11:09:25.889 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.misc.test_sfpu:moreh_sfpu_test:151 - tt_input ttnn.Tensor([[[[    0,     1,  ...,    46,    47],
               [   64,    65,  ...,   110,   111],
               ...,
               [  912,   913,  ...,   958,   959],
               [  976,   977,  ...,  1022,  1023]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::UINT32, layout=Layout::TILE)
2024-04-02 11:09:25.889 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.misc.test_sfpu:moreh_sfpu_test:152 - tt_output ttnn.Tensor([[[[ 0.00000,  0.00000,  ...,  0.00000,  0.00000],
               [ 0.00000,  0.00000,  ...,  0.00000,  0.00000],
               ...,
               [ 0.00000,  0.00000,  ...,  0.00000,  0.00000],
               [ 0.00000,  0.00000,  ...,  0.00000,  0.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)

# The expected result is that real values from 0 to 1023 should be stored in the output tensor.

Interestingly, I found a solution while writing test code to query this issue. A workaround is to always call pack_reconfig_data_format before calling the pack_tile function. However, since the data format of c_intermed0 CB is the same as c_in0 CB, I am curious whether this is a normal solution.

(python_env) ubuntu@tt-metal-dev-moreh-wh-9:~/tt-metal$ pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_sfpu.py::test_sfpu_bfloat16_output_is_ok

......
2024-04-02 11:09:33.848 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.misc.test_sfpu:moreh_sfpu_test:151 - tt_input ttnn.Tensor([[[[    0,     1,  ...,    46,    47],
               [   64,    65,  ...,   110,   111],
               ...,
               [  912,   913,  ...,   958,   959],
               [  976,   977,  ...,  1022,  1023]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::UINT32, layout=Layout::TILE)
2024-04-02 11:09:33.849 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.misc.test_sfpu:moreh_sfpu_test:152 - tt_output ttnn.Tensor([[[[ 0.00000,  1.00000,  ..., 46.00000, 47.00000],
               [64.00000, 65.00000,  ..., 110.00000, 111.00000],
               ...,
               [912.00000, 912.00000,  ..., 956.00000, 956.00000],
               [976.00000, 976.00000,  ..., 1020.00000, 1020.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
ttmtrajkovic commented 6 months ago

fp32_dest_acc_en setting

I've discovered that copy_tile() doesn't work for uint32 CB when the fp32_desc_acc_en flag is false. Setting this flag is required, right?

yes, fp32_dest_acc_en flag should be set when expecting data in DST registers to be stored in 32bits. I've added it only to unary op for testing, but you should have it set in your op (if operates on uint32)

packing to 'bfloat16' data format CB

If output CB and intermediate CB have different formats, then pack reconfig has to be called. I am not sure why would uint32 output be packed into float32 / bfloat16 CB though...

dongjin-na commented 6 months ago

I am not sure why would uint32 output be packed into float32 / bfloat16 CB though...

Our random Number Generator (RNG) generates random integer numbers first and then scales them to 0.0 to 1.0 float numbers.

If output CB and intermediate CB have different formats, then pack reconfig has to be called.

I see, but for packing to c_intermed0 CB, I don't think there is a need to pack_recongif because the data type of DST register or c_intermed0 is UINT32, but if this function is not called, the results aren't correct.

ttmtrajkovic commented 6 months ago

The reason why you need reconfigure there is because default configuration for packer happens in function init_sfpu -> unary_op_init_common; the function name is llk_pack_hw_configure_disaggregated. It is called for output CB, therefore pack is configured for output CB format. Since you're invoking packing to intermediate 0 before packing to output, you need to reconfigure.

dongjin-na commented 6 months ago

I got it. Thanks for the check.

Regarding integer operations in SFPU, positive integer operations seem to work well with SFPU. (Negative integer operations, including bit operations, are very different from using 2's complement, so I will use positive integers only.)

I think this issue can be closed.

If a problem arises later, I will create a new issue.