Open kkwong10 opened 1 year ago
I'd like to follow this task as well; we think some unrolling can help performance and I'm exploring it in the branch: ma/recip for improving throughput of recip op
FYI, there isn't much to be done here. When I went through, all the LLK kernels support vector mode here, even the new ones it looks. So all that is remaining is adding the hookup logic + ops library using this + pytests for this.
@muthutt Are you planning on taking on this task or are you just doing the investigation on unroll?
@davorchap @pgkeller @muthutt How do you guys envision this for end users of the compute API? Wanted your thoughts on this.
Current compute paradigm is to have _cols
and _rows
in the name of the compute API. Examples are the following for bcast:
mul_tiles_bcast_cols ...
mul_tiles_bcast_rows ...
// Initial proposal following naming scheme
void recip_tile(uint32_t idst) {...}
void recip_row(uint32_t idst) {...}
void recip_col(uint32_t idst) {...}
I personally prefer the approach of using a Dim parameter as follows, but that might be too different, so i'm leaning towards following the initial proposal.
void recip(uint32_t idst, Dim dim) {...} // Dim can be Dim::R Dim::C Dim::RC
Let me know if you guys have preference
Motivation
Reciprocal is one of the slowest SFPU ops that is very often used (in Softmax and Layernorm), and it adds a non-trivial amount — in Softmax, a staggering 18% of time is spent doing reciprocal on 32x32 tiles but we need reciprocal only for the left column (32 values of out of 32x32), because in main use cases, reciprocal follow a reduce. and the result of the reduce is stored in 32x32, even though only 32 values are relevant.
Proposal
To utilize the existing LLK vector mode (which computes on a 32x1(col vector --
Dim::C
) or 1x32 (row vector --Dim::R
), we can get up 8x-16x improvements on reciprocal or any SFPU ops which is a vectorLLK Support (Already there)
Work needed