Open marty1885 opened 7 months ago
@davorchap, @jvasilje. This user came from the commumity. This seems architectural in nature, any feedback?
assigning to myself while triaging.
looping in @ttmtrajkovic
mul -> reduce
chain.
matmul in GS can efficiently handle b=4 (4 rows in a tile)
Is there an API to get GS to do batch=4? It feels like much better then what I can get from the matmul API (1/32 utilization) or multiply then reduce (also very low utilization).
- this request is related to an existing case: [Feature Request] mul_tiles between a CB and a DST #6916 , for eltwise mult to work on an operand from DST
Support to load from DST into SRC registers could be added which saves time in spilling and loading into intermediate buffer. However, move from DST to SRC registers could be done on a tile or, in case of Grayskull, sub-tile level as SRC registers are not as big as DST. This API doesn't exist yet in tt-metal and it's not at the priority to add it.
- GEMV: using a native matmul in the FPU may likely be faster (even at lower utilization) than doing mul -> reduce chain. matmul in GS can efficiently handle b=4 (4 rows in a tile) matmul in WH ca efficiently handle b=8 (8 rows in a tile) Milos can provide more insight.
Using SFPU to do any type of matrix operation is slow and for GEMV, it would be more efficient to just pad the vector to 32x32 tile and use matmul_tiles
API. GS / WH can handle less than 32x32, GS in a cycle does 4x16 multiplied by 16x16 while WH does 8x16 multiplied by 16x16, but the bottleneck will be in that case on moving data to FPU since there's no data reuse so in the end efficiency will be low. However, it would still be better than any other matrix-vector multiplication we could build.
There's currently no support or plan to have operands at less than 32x32.
based on the current comment, will assign it as P3 and can bump if the discussion progresses.
Is your feature request related to a problem? Please describe.
I'm working on a GEMV implementation for LLM inference. Since the SFPU does not support GEMV natively. I had to make my own from tile multiplication and reduction. Currently it looks like this.
I suppose this is not efficient as data has to travel between the SFPU and the L1 memory. Likewise for adding tiles. This would be helpful in chaining adding bias after matrix multiplications.
Describe the solution you'd like
I'd like API that allows me to directly reduce from the output of
mul_tiles
. Something like the followingDescribe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.