tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
399 stars 50 forks source link

[MCW] Performance optimisation of YoloV4 #12087

Open saichandax opened 3 weeks ago

saichandax commented 3 weeks ago

Executive summary: The model is implemented in n150 with new conv-api. Dimension: 1, 3, 320, 320 3 torch maxpools (Need to test fix given in #7746) #11998 resolved with Sankar's fix.

Initially FPS,

                    FPS (MatMul/Conv Ops only)          FPS (Other Device Ops)        FPS (All Ops)                 
 DS2                        1841.329                            307.526                     277.922
 DS3                1256.896                 477.782               376.302
 DS4                        416.134              278.574               181.068
 DS5                        580.546              622.847               357.998
 Head                       460.31                        1835.496             400.402
 Neck               587.667                 622.847            357.998

On enabling proper sharding and reshard_if_not_optimal, (if needed) we got

                    FPS (MatMul/Conv Ops only)          FPS (Other Device Ops)        FPS (All Ops)                 
 DS2                        1839.27                       306.443                      277.028
 DS3                        1256.856                         476.952                       375.764
DS4                         416.179              277.593               1180.655
 DS5                643.117              976.24            1451.847
Head                        583.368                       1977.965                      501.543
 Neck                   571.545                         17.237                      16.855

After changing the datatype to bfloat8, the results are,

                    FPS (MatMul/Conv Ops only)          FPS (Other Device Ops)        FPS (All Ops)                 
 DS2                        2119.003                            317.142                     290.356
 DS3                        1292.711                         487.04                    384.33
DS4                         428.12                       282.164               1184.093
 DS5                649.266              998.355               1457.761

Note: For DS3, DS4 and DS5, there is a drop in PCC when I changed the datatype to bfloat8. After changing the memory config of Concat from interleaved to sharded as mentioned in this issue, we got

                    FPS (MatMul/Conv Ops only)          FPS (Other Device Ops)        FPS (All Ops)                 
 DS2                        1839.27                       306.443                       277.028
 DS3                        1257.139                         483.318                       379.743

Onchanging the memory config of Concat from interleaved to sharded as mentioned in this issue, but got this issue in DS4, DS5, head and neck

DS2 ds2_b8.csv ds2_concat.csv ds2_initial.csv ds2_reshard.csv

DS3 initial_ds3.csv reshard_Ds3.csv dtype_ds3.csv concat_ds3.csv

DS4 ds4_initial.csv, reshard_ds4.csv, ds4_dtype_bf8.csv

DS5 ds5_shard.csv, ds5_initial.csv, ds5_dtype_bf8.csv

Neck neck_initial.csv neck_reshard.csv

Head head_initial.csv head_shard.csv

saichandax commented 2 weeks ago

Need to test: #12182

saichandax commented 1 week ago

@keerthana-r-mcw , did we test #12182?