typecast operator - Githubissues

muthutt commented 5 months ago

Parent Issue: #9106

Current status:

[x] BFLOAT16 -> UINT32 - Completed
[x] BFLOAT16 -> UINT16 - Completed
[x] UINT16 -> BFLOAT16 - Completed
[x] INT32 -> BFLOAT16 - Completed
[x] BFLOAT16 -> INT32 - Completed

---- BFLOAT16 MNIST & GPT training unblocked -----

[x] FP32 <-> BFLOAT16 (w/ rounding) - Completed

---- FP32 MNIST & GPT training unblocked -----

[x] FP32 -> UINT16
[x] UINT16 -> FP32
[x] FP32 -> INT32
[x] INT32 -> FP32
[x] BFP8 -> UINT16
[x] UINT16 -> BFP8
[x] INT32 -> BFP8
[x] BFP8 -> INT32
[x] UINT32 -> BFLOAT16
[x] FP32 -> UINT32
[x] UINT32 -> FP32
[x] UINT32 -> BFP8
[x] BFP8 -> UINT32

muthutt commented 5 months ago

PR out at on branch bharane/forward_typecast

muthutt commented 5 months ago

Some limits for typecast of int/uint -> double

UINT8_T 
MAX = 255, MIN = 0
######################################
UINT16_T 
MAX = 65535, MIN = 0
######################################
UINT32_T 
MAX = 4294967295, MIN = 0
######################################
UINT64_T 
MAX = 18446744073709551615, MIN = 0
######################################
INT8_T 
MAX = 127, MIN = -128
######################################
INT16_T 
MAX = 32767, MIN = -32768
######################################
INT32_T 
MAX = 2147483647, MIN = -2147483648
######################################
INT64_T 
MAX = 9223372036854775807, MIN = -9223372036854775808

muthutt commented 4 months ago

float -> int has landed.

jliangTT commented 3 months ago

PR reviewed here - https://github.com/tenstorrent-metal/tt-metal/pull/5203

Backward here - #3899

jliangTT commented 3 months ago

Recommendation: Can we limit issue of the work to specific conversion between uint32/int16 <> float and have @ttmtrajkovic delivered on this? Further scope for generality support, let's make separate request.

razorback3 commented 3 months ago

UINT32 -> BFLOAT16: OK UINT32 -> BFP8: OK UINT32 -> FP32: Not supported UINT32 -> FP16: Not supported

BFP8 -> UINT32: OK BFLOAT16 -> UINT32: OK FP32 -> UINT32: Not supported FP16 -> UINT32: Not supported

Is my understanding correct?

ttmtrajkovic commented 3 months ago

@razorback3,

I am not sure if any of the conversions work for you yet, they might using the staircase (slow) method. Efficient implementations are not available and can't work at the moment. I can work on the following conversions:

UINT32 -> BFLOAT16 UINT16 -> BFLOAT16 BFLOAT16 -> UINT32 BFLOAT16 -> UINT16

UINT16 conversions will be super-fast, using the built-in instruction, while UINT32 conversions will have some custom sfpi code that does exponent extraction, shifting, rounding. Do you need UINT16 conversions?

As a second step, I can provide support for the following:

FP32 -> UINT32/UINT16 UINT16/UINT32 -> FP32

FP16 format is not supported in ttnn stack so you can't use it.

Let me know if you have any questions.

Milos

razorback3 commented 2 months ago

Here is Moreh's priority:

UINT32/16 <-> BFLOAT16
FP32 <-> BF16 (by the way, is this available?)
UINT32/16 <-> FP32

Number 1 is highest.

Thank you :)

ttmtrajkovic commented 2 months ago

Item 2. Should already be available if you're ok with just simple truncation (no rounding). Setting up input tensor as fp32 and output tensor as BFLOAT16 in the simple copy op should do the work.

milos

jliangTT commented 2 months ago

@ttmtrajkovic , can you provide the status for the support for the following?

UINT32 -> BFLOAT16 UINT16 -> BFLOAT16 BFLOAT16 -> UINT32 BFLOAT16 -> UINT16

jliangTT commented 2 months ago

update from speaking to @ttmtrajkovic:

there has been priority shuffle so this is going more slowly with best effort.
he is still making progress and am planning to work on it today. Will check in weekly to make sure there is forward movement.

jliangTT commented 2 months ago

no new status. will check in in a week.

davorchap commented 1 month ago

Synced with @razorback3 , these are outstanding

FP32 -> UINT32/UINT16
UINT16/UINT32 -> FP32

ttmtrajkovic commented 1 month ago

Hi @razorback3, @davorchap, I haven't been able to work on these in the past few weeks, however, those are now reassigned to @rdjogoTT so some progress is being made. Stay tuned for updates

davorchap commented 1 month ago

Hi @razorback3, @davorchap,

I haven't been able to work on these in the past few weeks, however, those are now reassigned to @rdjogoTT so some progress is being made. Stay tuned for updates

Great, thank you!

rdjogoTT commented 1 month ago

Progress update:

FP32 -> UINT32 LLK implemented and merged. API created similarly to eltwise unary ops. All testing complete.
FP32 -> UINT16 should be straightforward extension of UINT32

UINT16 should be taken care of by end of Tuesday, with work on UINT(32/16) -> FP32 being greatly accelerated thanks to progress here

ayerofieiev-tt commented 1 month ago

Please see an update on #8540 https://github.com/tenstorrent/tt-metal/issues/8540#issuecomment-2143229766

CC @davorchap @rdjogoTT @eyonland

jvasilje commented 4 weeks ago

@rdjogoTT how is the remaining work going? any updates?

rdjogoTT commented 4 weeks ago

@rdjogoTT how is the remaining work going? any updates?

FP32->UINT16 is nearing completion, the llk is functional and tested. Just need to add params to the op now to be able to choose which typecast llk to use depending on desired dtypes.

ayerofieiev-tt commented 4 weeks ago

@rdjogoTT completed the typecast for BFLOAT16 --> UINT16

rdjogoTT commented 4 weeks ago

@razorback3 before I start work on UINT32 -> BFLOAT16, would it be sufficient to support INT32 -> BFLOAT16 or do we specifically need UINT32? There is a faster path for implementing INT32.

razorback3 commented 4 weeks ago

@razorback3 before I start work on UINT32 -> BFLOAT16, would it be sufficient to support INT32 -> BFLOAT16 or do we specifically need UINT32? There is a faster path for implementing INT32.

For now, int32 should be affordable. I think there is no use-case that uses full bits of uint32 right now.

rdjogoTT commented 3 weeks ago

Progress update:

INT32 -> BFLOAT16 kernel writing underway, should be completed June 6th. Opting for INT32 rather than UINT32 due to faster implementation path and no use case that needs full uint32 bits for now.
INT16 -> BFLOAT16 to follow

rdjogoTT commented 3 weeks ago

@razorback3 before I start work on UINT32 -> BFLOAT16, would it be sufficient to support INT32 -> BFLOAT16 or do we specifically need UINT32? There is a faster path for implementing INT32.

For now, int32 should be affordable. I think there is no use-case that uses full bits of uint32 right now.

@ayerofieiev-tt raised the concern that if we implement INT32->BFLOAT16, the BFLOAT16->UINT32->BFLOAT16 loop would not be possible. Is this loop a requirement or can we just move forward with INT32 for now?

rdjogoTT commented 3 weeks ago

Progress update:

UINT16 -> BFLOAT16 implemented and merged into main. INT32 -> BFLOAT16 kernel writing underway, should be completed Monday. Opting for INT32 rather than UINT32 due to faster implementation path and no use case that needs full uint32 bits for now.

razorback3 commented 3 weeks ago

@razorback3 before I start work on UINT32 -> BFLOAT16, would it be sufficient to support INT32 -> BFLOAT16 or do we specifically need UINT32? There is a faster path for implementing INT32.

For now, int32 should be affordable. I think there is no use-case that uses full bits of uint32 right now.

@ayerofieiev-tt raised the concern that if we implement INT32->BFLOAT16, the BFLOAT16->UINT32->BFLOAT16 loop would not be possible. Is this loop a requirement or can we just move forward with INT32 for now?

@rdjogoTT, I thought if INT32->BFLOAT16 is supported, then BLOAT16->INT32 will be also available. So, the loop can be BFLOAT16->INT32->BFLOAT16. Would it be possible?

@rdjogoTT , @ayerofieiev-tt Can anyone summarize the currently supported conversions? I mean what is supported and what is not supported.

rdjogoTT commented 3 weeks ago

@razorback3 before I start work on UINT32 -> BFLOAT16, would it be sufficient to support INT32 -> BFLOAT16 or do we specifically need UINT32? There is a faster path for implementing INT32.

For now, int32 should be affordable. I think there is no use-case that uses full bits of uint32 right now.

@ayerofieiev-tt raised the concern that if we implement INT32->BFLOAT16, the BFLOAT16->UINT32->BFLOAT16 loop would not be possible. Is this loop a requirement or can we just move forward with INT32 for now?

@rdjogoTT, I thought if INT32->BFLOAT16 is supported, then BLOAT16->INT32 will be also available. So, the loop can be BFLOAT16->INT32->BFLOAT16. Would it be possible?

@rdjogoTT , @ayerofieiev-tt Can anyone summarize the currently supported conversions? I mean what is supported and what is not supported.

Ok, I will make sure the loop is supported.

Currently supported conversions:

BFLOAT16 -> UINT32
BFLOAT16 -> UINT16
UINT16 -> BFLOAT16

Not yet supported (next step):

INT32<->BFLOAT16

razorback3 commented 3 weeks ago

To make clear, FP32 <-> BF16 is also currently working with bit truncation. Right? Is WH supports conversion between them with rounding?

rdjogoTT commented 3 weeks ago

To make clear, FP32 <-> BF16 is also currently working with bit truncation. Right? Is WH supports conversion between them with rounding?

The answer is yes to both questions. For the second question: this would require unpack to dest support for fp32 to be added as well as a new sfpu kernel.

razorback3 commented 3 weeks ago

To make clear, FP32 <-> BF16 is also currently working with bit truncation. Right? Is WH supports conversion between them with rounding?

The answer is yes to both questions. For the second question: this would require unpack to dest support for fp32 to be added as well as a new sfpu kernel.

I see. Then, I think this will be the remaining request from Moreh to unblock LLM training: INT32 <-> BFLOAT16 FP32 <-> BFLOAT16

Thanks for your support :)

(cc. @dongjin-na , @namhyeong-kim )

rdjogoTT commented 3 weeks ago

To make clear, FP32 <-> BF16 is also currently working with bit truncation. Right? Is WH supports conversion between them with rounding?

The answer is yes to both questions. For the second question: this would require unpack to dest support for fp32 to be added as well as a new sfpu kernel.

I see. Then, I think this will be the remaining request from Moreh to unblock LLM training: INT32 <-> BFLOAT16 FP32 <-> BFLOAT16

Thanks for your support :)

(cc. @dongjin-na , @namhyeong-kim )

Could you just please specify if you need FP32 -> BFLOAT16 with rounding or not?

razorback3 commented 3 weeks ago

Could you just please specify if you need FP32 -> BFLOAT16 with rounding or not?

Yeah. I meant with rounding.

rdjogoTT commented 3 weeks ago

Progress update:

INT32 -> BFLOAT16 kernel complete and tested, just need to run post-commit testing and get approval for PR.
Next step is BFLOAT16 -> INT32 to complete the loop and then FP32 <-> BFLOAT16 with rounding.

rdjogoTT commented 3 weeks ago

Progress update:

BFLOAT16 -> INT32 in progress, should be done Wednesday.
FP32 <-> BFLOAT16 with rounding to follow

sraizada-tt commented 2 weeks ago

Can we get support for uint16->uint32 too? https://github.com/tenstorrent/tt-metal/issues/9441 I see you have uint16->bfloat16 and bfloat16->uint32 working

yieldthought commented 2 weeks ago

Background on this: we are asking for UINT16->UINT32 because the top-k op outputs UINT16 only but the embedding op accepts UINT32 only and LLMs need to chain these together.

An alternative is UINT32 top-k output or UINT16 embedding input support. But this is a strange thing to be missing either way.

ttmtrajkovic commented 2 weeks ago

@yieldthought, UINT32/INT32 for indices of top-k can be done but it will slow as HW has a limitation not be able to efficiently transpose INT32 numbers that are in tiles. UINT16 should be ok for embedding table sizes we have (exception is Llama3 that recently came out) so it was ok to proceed with UINT16 implementation.

I think support to do typacst @sraizada-tt requested should be easy to add, for now, until we pplan for int32 / uint32 for top-k indices.

Milos

yieldthought commented 2 weeks ago

In addition to Llama 3, Grok-1 also uses a 128k embedding size (which I am bringing up this week). I would not be surprised - especially give llama3's success - if larger embedding sizes become the norm. UINT16 is de-facto not ok for embedding table sizes.

What is our plan to support top-k for modern model architectures?

ttmtrajkovic commented 2 weeks ago

if by modern model architectures you mean embeddings larger than 64k then we will have to implement uint32 as indices and do some work on the existing top-k implementation. it's doable, but it will take some time that we need to scope

rdjogoTT commented 2 weeks ago

Progress update:

FP32 <-> BFLOAT16 merged FP32 <-> UINT16 next up

sangwon-chae commented 1 week ago

@rdjogoTT Thanks for your update. Do you have an expecting date for FP32<->UNIT16?

rdjogoTT commented 1 week ago

@rdjogoTT Thanks for your update. Do you have an expecting date for FP32<->UNIT16?

I should be able to get both directions tested and in by the end of tomorrow.

rdjogoTT commented 1 week ago

Progress update:

FP32 <-> UINT16 merged FP32 <-> INT32 by June 27

rdjogoTT commented 5 days ago

@razorback3 With the latest merge I have added support for all typecasts requested except UInt32 cases, for which we said Int32 should be good enough for now. I will continue to implement the remaining few typecasts, but can you confirm that this is no longer blocking other work?

razorback3 commented 2 days ago

@razorback3 With the latest merge I have added support for all typecasts requested except UInt32 cases, for which we said Int32 should be good enough for now. I will continue to implement the remaining few typecasts, but can you confirm that this is no longer blocking other work?

Yup, no more blocker for right now. Great work!

rdjogoTT commented 1 day ago

@razorback3 All requested typecast variations are now merged, I think this issue can be closed.

tenstorrent / tt-metal

typecast operator #4858