Closed dongjin-na closed 6 months ago
More context about this: this is part of attempting to implement a suitable RNG of the dropout operation (utilizing the integer arithmetic in the kernel)
I will categorize this to P2 for now given that this is associated with the new feature (dropout) that needs to be discuss.
Please provide a simple example that we can reproduce and debug
I tested Uint16 and and Uint32 as inputs, the output is a mix of zeros and garbage. The kernel and host code looks fine, probably the issue is within llks.
In WHB0, directly loading Uint16 data into src registers are not supported, as the FPU supports only (u)int8 data format. Instead we can load Uint16 into dest registers and run SFPU integer ops. In this mode dest has to be setup to int32. Then packer can pack int32 or int8 output.
Dear @yugaoTT,
Instead we can load Uint16 into dest registers and run SFPU integer ops. In this mode dest has to be setup to int32.
Could you share an example?
hey @yugaoTT, @mywoodstock,
With latest uplift of wormhole llks and my changes in the data_format.cpp, we should be able to do datacopy/move and sfpu op with uint16. is there a test that I can use to try this?
while we wait for @yugaoTT and @mywoodstock to comment, a good checking point is from Moreh - https://github.com/tenstorrent-metal/tt-metal/pull/4964
yes as @jliangTT mentioned, that is one I used to test out Uint16
UInt16 support for moving data through tensix core should be working.
However, in order for sfpu operations to use uint16, it would require sfpi compiler support to recognize uint16 as a separate type and set instr_mode
of SFPLOAD and SFPSTORE instructions to 6
.
@pgkeller, is it possible to recognize uint16 as a type through sfpi currently?
Milos
No, not at present. Design for other data types is in progress
Probably a stupid question..
UInt16 support for moving data through tensix core should be working. However, in order for sfpu operations to use uint16, it would require sfpi compiler support
So is this still useful, @razorback3 at the current level of support?
Dear @jliangTT , to implement a random state, we need these operators that support the UInt32 data type as follows.
Probably a stupid question..
UInt16 support for moving data through tensix core should be working. However, in order for sfpu operations to use uint16, it would require sfpi compiler support
So is this still useful, @razorback3 at the current level of support?
No 😢. Please refer to the above answer from @nadongguri.
SFPU doesn't recognize uint32 data format but int32 and all operations you listed are possible with SFPU.
The initial request on this thread comes to support Uint16 so it really has to be broken down into couple of requests:
Milos
@ttmtrajkovic , thanks for the information.
I have a question.
SFPU doesn't recognize uint32 data format
Is the reason that the SFPU doesn't recognize the uint32 data format because the hardware does not support it?
This issue was created regarding the dropout op implementation. UInt16 support is needed to improve TT's dropout op, and UInt32 support is needed to implement a new dropout op.
Through discussions, we decided to implement a new dropout op so UInt32 support is really needed.
SFPU Hardware doesn't recognize unsigned integer format so once SFPU starts operating with data, everything is signed int32. You can interpret data as 2's complement signed or sign+magnitude (default).
SFPU Hardware doesn't recognize unsigned integer format so once SFPU starts operating with data, everything is signed int32. You can interpret data as 2's complement signed or sign+magnitude (default).
Thanks for the check. I think using int32 to implement a random state with those operations may be possible instead of uint32. As mentioned above, is Int32 support in tt-metal planned after Uint16 support is completed?
@ttmtrajkovic Additionally, UINT32 data type processing is not implemented in dataflow_api.h (GET_L1_TILE_SIZE and MUL_WITH_TILE_SIZE), which is required for the index-based nll_loss implementation. (https://github.com/tenstorrent-metal/tt-metal/issues/5633)
Currently, we are using InterleavedAddrGen + get_noc_addr + noc_async_read as a workaround instead of InterleavedAddrGenFast + noc_async_read_tile. (https://github.com/tenstorrent-metal/tt-metal/pull/5634)
Thanks. I will address the support in dataflow_api.h if I hadn’t already. Let me check tomorrow.
could you also share summarize all the requirements with regards to integer support. Hardware can only support Int32 so I assume that target indices should also be represented as int32.
Few other questions:
thanks
Milos
Also followed up with @razorback3 in slack and hopefully this adds more clarity:
@ttmtrajkovic , do you need @razorback3 to break down the requirement of int32 supports further?
Thanks everyone. Int32/UInt32 support is clear to me with regards to dropout implementation. I will work on adding correct APIs to perform: 1) copy data 2) pack data 3) bitwise ops support in sfpu already works
I will fix the dataflow_api.h to support int32 and UInt32, but I am not clear how is target index in int32 going to be used. Int32 format in tensix hardware is in form Sign + Magnitude, so if you're using that as some operator in riscv, it won't work out of the box (you'd need to convert it to 2's complement).
HW is a bit tricky when it comes to manipulating raw integer numbers. And I will think about how to enable UInt32 as well, just for the purpose of SFPU bitwise operations.
Milos
hey @ttmtrajkovic , do we have a update here?
Spoke to @ttmtrajkovic offline - the current status: UInt16 is supported, Uint32 not yet. The uint32/int32 support is planned for next week.
Hi, is it possible to share any updates on the progress of int32 support? :)
I am working on adding uint32 support so that you can enable dropout calculation.Int32 will follow, although it has been requested to support int32 for doing math on indices On Mar 16, 2024, at 04:26, Dongjin Na @.***> wrote: Hi, is it possible to share any updates on the progress of int32 support? :)
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Thanks for sharing the status. That's right. I was trying to inquire about the status of uint32 support for dropout op.
Can I have some questions regarding this?
For 1) - end of this week. For 2) - Packing between different formats is possible, but not for integers. uint32 -> uint16 cannot be done without proper descaling so packing uint32 into uint16 isn't supported since packer cannot do the descaling and range check. you should be able to use sfpu to load uint16, descale (if needed) and round to uint16, then pack out as uint16.
I will have support for 2) ready as well.
Thanks for the check and detailed information provided.
For 1) - end of this week.
It's great to hear that support will be ready by the end of this week.
For 2) - Packing between different formats is possible, but not for integers.
So, is it possible to pack an integer from DST into CB and vice versa?
uint32 -> uint16 cannot be done without proper descaling so packing uint32 into uint16 isn't supported since packer cannot do the descaling and range check. you should be able to use sfpu to load uint16, descale (if needed) and round to uint16, then pack out as uint16.
I will have support for 2) ready as well.
If DST registers can have different data formats for each register (For example, the first tile in the dst register is UInt32, and the second tile is Float32 or bfloat16.) and SFPU can handle them, then I don't think 2) above is necessary. (But when I looked at the microcode, it seems such cases are absent, so I'm unsure whether it's possible.) To explain in more detail: When the RNG uses SFPU to generate a 32-bit integer random value, the range for this value is [-INTMAX, INTMAX].
DST registers have general modes of operation, depending on the size of datums: 32bit or 16bit. If any of the tiles has 32bit format, then 32bit mode has to be selected. If mode is selected to be 32bit, then DST storage is halved and 8 tiles can fit.
You will be able to have input CB in fp32, unpack into DST registers (without loss of precision), consume the tiles in sfpu, and produce output tiles in uint16. The hardware should handle the storage of uint16 elements properly and packer will get configured to pack out in uint16. You have to be aware of the reduced capacity (half DST space is available in 32bit mode), but everything else should work. Although in theory you could have different formats on every tile, you shouldn't be doing that as it would require API to reconfigure data format in packer per tile. What I suppose you will be doing is to have input tiles at one format, and output tiles at some other format, that should work seamlessly.
I am adding the API to unpack fp32/int32 from CB directly into DST (without loss of precision due to unpack into SRC) and then you should see if anything else is missing.
you can do different casting in SFPU:
int32_to_float
(sfpi_lib.h)float_to_uint16 / float_to_int16
(sfpi_lib.h)There's a lot of details here and we've never really used our HW in this way so there might be bugs for some cases, but in general these features should be supported. Once you start using it, we can take a look at the non-working cases.
Does this answer your question?
Milos
I appreciate your detailed answer. Yes, we can take a look at the non-working cases.
For 2) - Packing between different formats is possible, but not for integers. uint32 -> uint16 cannot be done without proper descaling so packing uint32 into uint16 isn't supported since packer cannot do the descaling and range check. you should be able to use sfpu to load uint16, descale (if needed) and round to uint16, then pack out as uint16. I will have support for 2) ready as well.
To confirm: @ttmtrajkovic , are you targeting the end of the week for this deliverables as well?
hello @dongjin-na,
I apologize for the delay in delivering this, it's now merged in main and you can use UINT32 format and load it into sfpu.
You can take a look at the example sfpu: calculate_identity_uint32
in ckernel_sfpu_identity.h
, for wormhole_b0.
Merged at: https://github.com/tenstorrent-metal/tt-metal/pull/6796
@ttmtrajkovic , the status of this issue is a little weird - it went from closed, to promoting to P0_showstopper, and to re-opened. Is this intentional?
Hey Jason,I can’t comment on previous actions but I’ve completed all the work and it’s up to Moreh to give this a try.I’ve initially closed it but then reopened in order for Moreh team to acknowledge the work.I’m ok either way, but it would be good to see if it works for them.MilosOn Apr 1, 2024, at 19:46, Jason Liang @.***> wrote: @ttmtrajkovic , the status of this issue is a little weird - it went from closed, to promoting to P0_showstopper, and to re-opened. Is this intentional?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
got it. @ttmtrajkovic . thanks for the clarification. @razorback3 , please review and close once you are okay.
Hi @ttmtrajkovic, thanks for your work.
I've checked the identity uint32 example you mentioned, and it works well. :) Now, I am writing a kernel using SFPU with a uint32 data format tile. I have some queries regarding using uint32 in SFPU.
fp32_dest_acc_en
settingI've discovered that copy_tile() doesn't work for uint32 CB when the fp32_desc_acc_en
flag is false. Setting this flag is required, right?
Packing the uint32 data format tile calculated in the dst register by SFPU directly into float32 or bfloat16 data format output CB works properly.
However, when intermediate CB is used (e.g., c_intermed0), bfloat16 output CB does not seem to be packed normally.
The relevant code examples can be found in the sfpu_uint32
branch of the tt-metal repository.
link : https://github.com/tenstorrent-metal/tt-metal/commit/95a2910866b127a203130d25e449b11e78eda3db
(python_env) ubuntu@tt-metal-dev-moreh-wh-9:~/tt-metal$ pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_sfpu.py::test_sfpu_bfloat16_output_is_weird
......
2024-04-02 11:09:25.889 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.misc.test_sfpu:moreh_sfpu_test:151 - tt_input ttnn.Tensor([[[[ 0, 1, ..., 46, 47],
[ 64, 65, ..., 110, 111],
...,
[ 912, 913, ..., 958, 959],
[ 976, 977, ..., 1022, 1023]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::UINT32, layout=Layout::TILE)
2024-04-02 11:09:25.889 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.misc.test_sfpu:moreh_sfpu_test:152 - tt_output ttnn.Tensor([[[[ 0.00000, 0.00000, ..., 0.00000, 0.00000],
[ 0.00000, 0.00000, ..., 0.00000, 0.00000],
...,
[ 0.00000, 0.00000, ..., 0.00000, 0.00000],
[ 0.00000, 0.00000, ..., 0.00000, 0.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
# The expected result is that real values from 0 to 1023 should be stored in the output tensor.
Interestingly, I found a solution while writing test code to query this issue. A workaround is to always call pack_reconfig_data_format before calling the pack_tile function. However, since the data format of c_intermed0 CB is the same as c_in0 CB, I am curious whether this is a normal solution.
(python_env) ubuntu@tt-metal-dev-moreh-wh-9:~/tt-metal$ pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_sfpu.py::test_sfpu_bfloat16_output_is_ok
......
2024-04-02 11:09:33.848 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.misc.test_sfpu:moreh_sfpu_test:151 - tt_input ttnn.Tensor([[[[ 0, 1, ..., 46, 47],
[ 64, 65, ..., 110, 111],
...,
[ 912, 913, ..., 958, 959],
[ 976, 977, ..., 1022, 1023]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::UINT32, layout=Layout::TILE)
2024-04-02 11:09:33.849 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.misc.test_sfpu:moreh_sfpu_test:152 - tt_output ttnn.Tensor([[[[ 0.00000, 1.00000, ..., 46.00000, 47.00000],
[64.00000, 65.00000, ..., 110.00000, 111.00000],
...,
[912.00000, 912.00000, ..., 956.00000, 956.00000],
[976.00000, 976.00000, ..., 1020.00000, 1020.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
fp32_dest_acc_en
settingI've discovered that copy_tile() doesn't work for uint32 CB when the
fp32_desc_acc_en
flag is false. Setting this flag is required, right?
yes, fp32_dest_acc_en
flag should be set when expecting data in DST registers to be stored in 32bits. I've added it only to unary op for testing, but you should have it set in your op (if operates on uint32)
packing to 'bfloat16' data format CB
If output CB and intermediate CB have different formats, then pack reconfig has to be called. I am not sure why would uint32 output be packed into float32 / bfloat16 CB though...
I am not sure why would uint32 output be packed into float32 / bfloat16 CB though...
Our random Number Generator (RNG) generates random integer numbers first and then scales them to 0.0 to 1.0 float numbers.
If output CB and intermediate CB have different formats, then pack reconfig has to be called.
I see, but for packing to c_intermed0 CB, I don't think there is a need to pack_recongif because the data type of DST register or c_intermed0 is UINT32, but if this function is not called, the results aren't correct.
The reason why you need reconfigure there is because default configuration for packer happens in function init_sfpu
-> unary_op_init_common
; the function name is llk_pack_hw_configure_disaggregated
. It is called for output CB, therefore pack is configured for output CB format. Since you're invoking packing to intermediate 0 before packing to output, you need to reconfigure.
I got it. Thanks for the check.
Regarding integer operations in SFPU, positive integer operations seem to work well with SFPU. (Negative integer operations, including bit operations, are very different from using 2's complement, so I will use positive integers only.)
I think this issue can be closed.
If a problem arises later, I will create a new issue.
I'm wondering current tt-metal supports integer type format support for CB. (UInt16, UInt32, etc) I revised test_eltwise_unary.cpp and related kernels, then tested.
I made a simple test.
Result
Is there something I set wrong or missed?
Here's the modification part of test_eltwise_unary.cpp and eltwise_sfpu.cpp kernel.
Target : WH_B0