concat op improvement to accept sharded data as input

tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

Apache License 2.0

396 stars 48 forks source link

concat op improvement to accept sharded data as input #10808

Closed dvartaniansTT closed 1 day ago

dvartaniansTT commented 1 month ago

Describe the bug concat op is currently limiting the performance of all CNNs as it only takes interleaved input data. This mandates sharded to interleaved and interleaved to sharded before and after concats. and it's very costly.

see attached example perf sheet from yolov4 model: (a very high priority customer model) GS_yolov4_19_07_2024.xlsx

please note, althoguh this is not conv op, it is limiting CNNs performance. Therefore, I'm adding it to CNN board.

Expected behavior concat op takes sharded data as input; elimanating the need for sharded to interleaved and interleaved to sharded before and after concat op.

Please complete the following environment information:

OS: [e.g. Ubuntu 20.04]
Device: GS and WH

davorchap commented 1 month ago

@dvartaniansTT can you post the specific concat config that doesn't support sharding ?

There is concat in Unet and SD and I believe those are sharded.

dvartaniansTT commented 1 month ago

@davorchap sure we will create unit tests and comment the branch and pytest command to run them here. cc: @punithsekar

tarafdarTT commented 3 weeks ago

@dvartaniansTT following up . do you have unit tests for this?

punithsekar commented 3 weeks ago

@tarafdarTT , I was not able to create a unit test for those Since, I was facing issue creating sharded input. I will check and share the info of the concat soon.

punithsekar commented 3 weeks ago

@tarafdarTT , I have created a branch related to pipeline where concat is used in yolov4. Checkout to branch, punith/yolov4_concat

I am not sure if i am using correct shard_shape and core_grid to the ttnn.create_sharded_memory_config for output_sharded_memory_config. Can you explain how we should give the shard_shape.

It will be helpful if you help to make the pipeline pass with sharded concat so, I can understand and use it in other sub_modules where concat is used in this model.

To run the downsample1.py test file use command pytest /tests/ttnn/integration_tests/yolov4/test_ttnn_downsample1.py.

Device: WH-n150

tarafdarTT commented 2 weeks ago

thanks @punithsekar I'll have a look at this today

sjameelTT commented 1 week ago

https://github.com/tenstorrent/tt-metal/pull/12182

@punithsekar Height sharding is splitting the collapsed outer 3 dimensions across a grid of cores that you put in (the grid doesn't have to be the device grid). Since it's concatenating along width, the shard shape along your height is unchanged from the input. The width dimension changes, so it increases the shard width there. Core grid isn't the physical device core grid in that API, it's just the cores that store the shards of the output tensors, which should be the same as the input tensor shard grids.

For user friendliness we probably want to allow an memory_config=None option since all of this info can be derived from the input tensors @tarafdarTT . Especially because the actual coordinates are hard to derive as a user after you've put the input tensors through a lot of ops.

punithsekar commented 1 week ago

Thanks @sjameelTT , I will try for remaining sub_modules and let you know if i encounter any issues.

punithsekar commented 1 week ago

fyi @saichandax

mbahnasTT commented 1 day ago

concat with shared data is currently implemented in Yolov4. Thanks!