Open muthutt opened 5 months ago
Some limits for typecast of int/uint -> double
UINT8_T
MAX = 255, MIN = 0
######################################
UINT16_T
MAX = 65535, MIN = 0
######################################
UINT32_T
MAX = 4294967295, MIN = 0
######################################
UINT64_T
MAX = 18446744073709551615, MIN = 0
######################################
INT8_T
MAX = 127, MIN = -128
######################################
INT16_T
MAX = 32767, MIN = -32768
######################################
INT32_T
MAX = 2147483647, MIN = -2147483648
######################################
INT64_T
MAX = 9223372036854775807, MIN = -9223372036854775808
float -> int has landed.
PR reviewed here - https://github.com/tenstorrent-metal/tt-metal/pull/5203
Backward here - #3899
Recommendation: Can we limit issue of the work to specific conversion between uint32/int16 <> float and have @ttmtrajkovic delivered on this? Further scope for generality support, let's make separate request.
UINT32 -> BFLOAT16: OK UINT32 -> BFP8: OK UINT32 -> FP32: Not supported UINT32 -> FP16: Not supported
BFP8 -> UINT32: OK BFLOAT16 -> UINT32: OK FP32 -> UINT32: Not supported FP16 -> UINT32: Not supported
Is my understanding correct?
@razorback3,
I am not sure if any of the conversions work for you yet, they might using the staircase (slow) method. Efficient implementations are not available and can't work at the moment. I can work on the following conversions:
UINT32 -> BFLOAT16 UINT16 -> BFLOAT16 BFLOAT16 -> UINT32 BFLOAT16 -> UINT16
UINT16 conversions will be super-fast, using the built-in instruction, while UINT32 conversions will have some custom sfpi code that does exponent extraction, shifting, rounding. Do you need UINT16 conversions?
As a second step, I can provide support for the following:
FP32 -> UINT32/UINT16 UINT16/UINT32 -> FP32
FP16 format is not supported in ttnn
stack so you can't use it.
Let me know if you have any questions.
Milos
Here is Moreh's priority:
Number 1 is highest.
Thank you :)
Item 2. Should already be available if you're ok with just simple truncation (no rounding). Setting up input tensor as fp32 and output tensor as BFLOAT16 in the simple copy
op should do the work.
milos
@ttmtrajkovic , can you provide the status for the support for the following?
UINT32 -> BFLOAT16 UINT16 -> BFLOAT16 BFLOAT16 -> UINT32 BFLOAT16 -> UINT16
update from speaking to @ttmtrajkovic:
no new status. will check in in a week.
Synced with @razorback3 , these are outstanding
FP32 -> UINT32/UINT16
UINT16/UINT32 -> FP32
Hi @razorback3, @davorchap, I haven't been able to work on these in the past few weeks, however, those are now reassigned to @rdjogoTT so some progress is being made. Stay tuned for updates
Hi @razorback3, @davorchap,
I haven't been able to work on these in the past few weeks, however, those are now reassigned to @rdjogoTT so some progress is being made. Stay tuned for updates
Great, thank you!
Progress update:
UINT16 should be taken care of by end of Tuesday, with work on UINT(32/16) -> FP32 being greatly accelerated thanks to progress here
Please see an update on #8540 https://github.com/tenstorrent/tt-metal/issues/8540#issuecomment-2143229766
CC @davorchap @rdjogoTT @eyonland
@rdjogoTT how is the remaining work going? any updates?
@rdjogoTT how is the remaining work going? any updates?
FP32->UINT16 is nearing completion, the llk is functional and tested. Just need to add params to the op now to be able to choose which typecast llk to use depending on desired dtypes.
@rdjogoTT completed the typecast for BFLOAT16 --> UINT16
@razorback3 before I start work on UINT32 -> BFLOAT16, would it be sufficient to support INT32 -> BFLOAT16 or do we specifically need UINT32? There is a faster path for implementing INT32.
@razorback3 before I start work on UINT32 -> BFLOAT16, would it be sufficient to support INT32 -> BFLOAT16 or do we specifically need UINT32? There is a faster path for implementing INT32.
For now, int32 should be affordable. I think there is no use-case that uses full bits of uint32 right now.
Progress update:
@razorback3 before I start work on UINT32 -> BFLOAT16, would it be sufficient to support INT32 -> BFLOAT16 or do we specifically need UINT32? There is a faster path for implementing INT32.
For now, int32 should be affordable. I think there is no use-case that uses full bits of uint32 right now.
@ayerofieiev-tt raised the concern that if we implement INT32->BFLOAT16, the BFLOAT16->UINT32->BFLOAT16 loop would not be possible. Is this loop a requirement or can we just move forward with INT32 for now?
Progress update:
UINT16 -> BFLOAT16 implemented and merged into main. INT32 -> BFLOAT16 kernel writing underway, should be completed Monday. Opting for INT32 rather than UINT32 due to faster implementation path and no use case that needs full uint32 bits for now.
@razorback3 before I start work on UINT32 -> BFLOAT16, would it be sufficient to support INT32 -> BFLOAT16 or do we specifically need UINT32? There is a faster path for implementing INT32.
For now, int32 should be affordable. I think there is no use-case that uses full bits of uint32 right now.
@ayerofieiev-tt raised the concern that if we implement INT32->BFLOAT16, the BFLOAT16->UINT32->BFLOAT16 loop would not be possible. Is this loop a requirement or can we just move forward with INT32 for now?
@rdjogoTT, I thought if INT32->BFLOAT16 is supported, then BLOAT16->INT32 will be also available. So, the loop can be BFLOAT16->INT32->BFLOAT16. Would it be possible?
@rdjogoTT , @ayerofieiev-tt Can anyone summarize the currently supported conversions? I mean what is supported and what is not supported.
@razorback3 before I start work on UINT32 -> BFLOAT16, would it be sufficient to support INT32 -> BFLOAT16 or do we specifically need UINT32? There is a faster path for implementing INT32.
For now, int32 should be affordable. I think there is no use-case that uses full bits of uint32 right now.
@ayerofieiev-tt raised the concern that if we implement INT32->BFLOAT16, the BFLOAT16->UINT32->BFLOAT16 loop would not be possible. Is this loop a requirement or can we just move forward with INT32 for now?
@rdjogoTT, I thought if INT32->BFLOAT16 is supported, then BLOAT16->INT32 will be also available. So, the loop can be BFLOAT16->INT32->BFLOAT16. Would it be possible?
@rdjogoTT , @ayerofieiev-tt Can anyone summarize the currently supported conversions? I mean what is supported and what is not supported.
Ok, I will make sure the loop is supported.
Currently supported conversions:
Not yet supported (next step):
To make clear, FP32 <-> BF16 is also currently working with bit truncation. Right? Is WH supports conversion between them with rounding?
To make clear, FP32 <-> BF16 is also currently working with bit truncation. Right? Is WH supports conversion between them with rounding?
The answer is yes to both questions. For the second question: this would require unpack to dest support for fp32 to be added as well as a new sfpu kernel.
To make clear, FP32 <-> BF16 is also currently working with bit truncation. Right? Is WH supports conversion between them with rounding?
The answer is yes to both questions. For the second question: this would require unpack to dest support for fp32 to be added as well as a new sfpu kernel.
I see. Then, I think this will be the remaining request from Moreh to unblock LLM training: INT32 <-> BFLOAT16 FP32 <-> BFLOAT16
Thanks for your support :)
(cc. @dongjin-na , @namhyeong-kim )
To make clear, FP32 <-> BF16 is also currently working with bit truncation. Right? Is WH supports conversion between them with rounding?
The answer is yes to both questions. For the second question: this would require unpack to dest support for fp32 to be added as well as a new sfpu kernel.
I see. Then, I think this will be the remaining request from Moreh to unblock LLM training: INT32 <-> BFLOAT16 FP32 <-> BFLOAT16
Thanks for your support :)
(cc. @dongjin-na , @namhyeong-kim )
Could you just please specify if you need FP32 -> BFLOAT16 with rounding or not?
Could you just please specify if you need FP32 -> BFLOAT16 with rounding or not?
Yeah. I meant with rounding.
Progress update:
Progress update:
Can we get support for uint16->uint32 too? https://github.com/tenstorrent/tt-metal/issues/9441 I see you have uint16->bfloat16 and bfloat16->uint32 working
Background on this: we are asking for UINT16->UINT32 because the top-k op outputs UINT16 only but the embedding op accepts UINT32 only and LLMs need to chain these together.
An alternative is UINT32 top-k output or UINT16 embedding input support. But this is a strange thing to be missing either way.
@yieldthought, UINT32/INT32 for indices of top-k can be done but it will slow as HW has a limitation not be able to efficiently transpose INT32 numbers that are in tiles. UINT16 should be ok for embedding table sizes we have (exception is Llama3 that recently came out) so it was ok to proceed with UINT16 implementation.
I think support to do typacst @sraizada-tt requested should be easy to add, for now, until we pplan for int32 / uint32 for top-k indices.
Milos
In addition to Llama 3, Grok-1 also uses a 128k embedding size (which I am bringing up this week). I would not be surprised - especially give llama3's success - if larger embedding sizes become the norm. UINT16 is de-facto not ok for embedding table sizes.
What is our plan to support top-k for modern model architectures?
if by modern model architectures you mean embeddings larger than 64k then we will have to implement uint32 as indices and do some work on the existing top-k implementation. it's doable, but it will take some time that we need to scope
Progress update:
FP32 <-> BFLOAT16 merged FP32 <-> UINT16 next up
@rdjogoTT Thanks for your update. Do you have an expecting date for FP32<->UNIT16?
@rdjogoTT Thanks for your update. Do you have an expecting date for FP32<->UNIT16?
I should be able to get both directions tested and in by the end of tomorrow.
Progress update:
FP32 <-> UINT16 merged FP32 <-> INT32 by June 27
@razorback3 With the latest merge I have added support for all typecasts requested except UInt32 cases, for which we said Int32 should be good enough for now. I will continue to implement the remaining few typecasts, but can you confirm that this is no longer blocking other work?
@razorback3 With the latest merge I have added support for all typecasts requested except UInt32 cases, for which we said Int32 should be good enough for now. I will continue to implement the remaining few typecasts, but can you confirm that this is no longer blocking other work?
Yup, no more blocker for right now. Great work!
@razorback3 All requested typecast variations are now merged, I think this issue can be closed.
Parent Issue: #9106
Current status:
---- BFLOAT16 MNIST & GPT training unblocked -----
---- FP32 MNIST & GPT training unblocked -----