tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
474 stars 75 forks source link

Add support for vector-mode SFPU (WH/GS) #3102

Open kkwong10 opened 1 year ago

kkwong10 commented 1 year ago

Motivation

Reciprocal is one of the slowest SFPU ops that is very often used (in Softmax and Layernorm), and it adds a non-trivial amount — in Softmax, a staggering 18% of time is spent doing reciprocal on 32x32 tiles but we need reciprocal only for the left column (32 values of out of 32x32), because in main use cases, reciprocal follow a reduce. and the result of the reduce is stored in 32x32, even though only 32 values are relevant.

Proposal

To utilize the existing LLK vector mode (which computes on a 32x1(col vector -- Dim::C) or 1x32 (row vector -- Dim::R), we can get up 8x-16x improvements on reciprocal or any SFPU ops which is a vector

LLK Support (Already there)

There's already support for vector mode in llk which does sfpu for a smaller number of iterations. This is used to reduce the amount of cycles. If you have col vector - 2x, if it is row instead of a col, then we have 8x speedup of the computation for GS and for WH it's 16x. Our simd is always 162(WH) 164(GS) and we have faces. An example - For the case of a 32x1 row, we split across 16x16 faces, you need to crunch for 16x2 *2 ( for 2 iterations for each of the top faces). This is 16x speedup since we need 2 iterations instead of 32. If we somehow use the reader to reorganize the data into a packed format, then we could get to a 32x since we can just call simd once instead of 32 times

Work needed

muthutt commented 1 year ago

I'd like to follow this task as well; we think some unrolling can help performance and I'm exploring it in the branch: ma/recip for improving throughput of recip op

kkwong10 commented 1 year ago

FYI, there isn't much to be done here. When I went through, all the LLK kernels support vector mode here, even the new ones it looks. So all that is remaining is adding the hookup logic + ops library using this + pytests for this.

kkwong10 commented 1 year ago

@muthutt Are you planning on taking on this task or are you just doing the investigation on unroll?

kkwong10 commented 1 year ago

@davorchap @pgkeller @muthutt How do you guys envision this for end users of the compute API? Wanted your thoughts on this.

Current compute paradigm is to have _cols and _rows in the name of the compute API. Examples are the following for bcast:

mul_tiles_bcast_cols ...
mul_tiles_bcast_rows ...

// Initial proposal following naming scheme
void recip_tile(uint32_t idst) {...}
void recip_row(uint32_t idst) {...}
void recip_col(uint32_t idst) {...}

I personally prefer the approach of using a Dim parameter as follows, but that might be too different, so i'm leaning towards following the initial proposal.

void recip(uint32_t idst, Dim dim) {...} // Dim can be Dim::R Dim::C Dim::RC

Let me know if you guys have preference