tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
469 stars 73 forks source link

Kernel API clean up #5420

Open acejkov opened 8 months ago

acejkov commented 8 months ago

Based on kernel review here are few items for clean up:

tt_metal/include/compute_kernel_api/tile_move_copy.h:ALWI void copy_tile_matmul_partials_init_short_with_dt(uint32_t cbid) { tt_metal/include/compute_kernel_api/tile_move_copy.h:ALWI void copy_tile_matmul_partials_init_short_with_dt(uint32_t old_cbid, uint32_t new_cbid) { tt_metal/include/compute_kernel_api/tile_move_copy.h:ALWI void copy_block_matmul_partials(uint32_t icb, uint32_t start_itile, uint32_t start_idst, uint32_t ntiles

a) Remove short/long inits as we don't need to reprogram dest registers and have only single init per op b) Replace init_once with hw configs that need to be inserted once per kernel run at the start of the kernel. This can be cleary stated in the programming guideline c) Review different versions of inits (_dt, block etc) and merge then into single init where doable
d) Remove init calls which flip between row and col major face layout

davorchap commented 8 months ago

This all looks great, let's do it.

rtawfik01 commented 8 months ago

I have the changes for item 1 here: https://github.com/tenstorrent-metal/tt-metal/pull/5457

It does cause some device performance regressions to fail:

2024-02-19 06:25:17.245 | ERROR    | models.perf.device_perf_utils:check_device_perf_results:131 - bert11_BERT_LARGE-batch_12-BFLOAT8_B-SHARDED AVG DEVICE KERNEL SAMPLES/S is too slow with 396.8593, min expected 397.7.
 2024-02-19 07:26:41.263 | ERROR    | models.perf.device_perf_utils:check_device_perf_results:131 - resnet50_batch_size20_HiFi2-activations_BFLOAT8_B-weights_BFLOAT8_B-batch_20 AVG DEVICE KERNEL SAMPLES/S is too slow with 5559.3092, min expected 5567.8.

I decreased the lower bounds of the above models to pass the pipelines.

I also reviewed the performance of optimized convs that were using colmajor matmuls to compare with rowmajor matmuls:

image

and there is performance degradation for Bfp8 LoFi convs that take ~10k ns. So @davorchap @TT-BrianLiu let me know if this performance degradation is accepted and can be pushed into the pipelines

TT-BrianLiu commented 8 months ago

@rtawfik01 seems reasonable to me

davorchap commented 8 months ago

What's new device perf for BERT-L and RN50 lofi/bfp8_b ?

rtawfik01 commented 8 months ago

What's new device perf for BERT-L and RN50 lofi/bfp8_b ?

This is what I see for lofi/bfp8_b bert from the device performance regression:

 2024-02-23 01:38:10.347 | INFO     | models.perf.device_perf_utils:check_device_perf_results:117 - {'Model': 'bert11', 'Setting': 'BERT_LARGE-batch_12-BFLOAT8_B-SHARDED', 'Batch': '12', 'AVG DEVICE FW SAMPLES/S': '395.6074', 'MIN DEVICE FW SAMPLES/S': '395.5490', 'MAX DEVICE FW SAMPLES/S': '395.6615', 'AVG DEVICE KERNEL SAMPLES/S': '396.8989', 'Lower Threshold AVG DEVICE KERNEL SAMPLES/S': '378.3000', 'Upper Threshold AVG DEVICE KERNEL SAMPLES/S': '401.7000', 'MIN DEVICE KERNEL SAMPLES/S': '396.8392', 'MAX DEVICE KERNEL SAMPLES/S': '396.9539', 'AVG DEVICE BRISC KERNEL SAMPLES/S': '539.6978', 'MIN DEVICE BRISC KERNEL SAMPLES/S': '539.5799', 'MAX DEVICE BRISC KERNEL SAMPLES/S': '539.7987'}

and this is what I see for resnet:


2024-02-23 01:38:10.347 | INFO     | models.perf.device_perf_utils:check_device_perf_results:117 - {'Model': 'resnet50_batch_size16', 'Setting': 'LoFi-activations_BFLOAT8_B-weights_BFLOAT8_B-batch_16', 'Batch': '16', 'AVG DEVICE FW SAMPLES/S': '6090.6388', 'MIN DEVICE FW SAMPLES/S': '6068.7644', 'MAX DEVICE FW SAMPLES/S': '6130.3128', 'AVG DEVICE KERNEL SAMPLES/S': '6167.9524', 'Lower Threshold AVG DEVICE KERNEL SAMPLES/S': '5975.2000', 'Upper Threshold AVG DEVICE KERNEL SAMPLES/S': '6344.8000', 'MIN DEVICE KERNEL SAMPLES/S': '6145.5331', 'MAX DEVICE KERNEL SAMPLES/S': '6208.6976', 'AVG DEVICE BRISC KERNEL SAMPLES/S': '6373.3454', 'MIN DEVICE BRISC KERNEL SAMPLES/S': '6349.9976', 'MAX DEVICE BRISC KERNEL SAMPLES/S': '6416.8970'}

2024-02-23 01:38:10.348 | INFO     | models.perf.device_perf_utils:check_device_perf_results:117 - {'Model': 'resnet50_batch_size20', 'Setting': 'LoFi-activations_BFLOAT8_B-weights_BFLOAT8_B-batch_20', 'Batch': '20', 'AVG DEVICE FW SAMPLES/S': '6623.5572', 'MIN DEVICE FW SAMPLES/S': '6598.3691', 'MAX DEVICE FW SAMPLES/S': '6644.0216', 'AVG DEVICE KERNEL SAMPLES/S': '6701.5628', 'Lower Threshold AVG DEVICE KERNEL SAMPLES/S': '6469.9000', 'Upper Threshold AVG DEVICE KERNEL SAMPLES/S': '6870.1000', 'MIN DEVICE KERNEL SAMPLES/S': '6675.8482', 'MAX DEVICE KERNEL SAMPLES/S': '6722.6755', 'AVG DEVICE BRISC KERNEL SAMPLES/S': '6945.7269', 'MIN DEVICE BRISC KERNEL SAMPLES/S': '6917.7846', 'MAX DEVICE BRISC KERNEL SAMPLES/S': '6968.0341'}
rtawfik01 commented 8 months ago

Hi @davorchap, after some more cleanup, I now get resnet to run 6.7-6.8k fps with lofi/bfp8 batch 20 with my changes and it no longer fails in the device perf regressions, so I no longer need to reduce the lower bounds for resnet.

For bert I still need to reduce the lower bound, since I get 398 samples/s for batch 12 bfp8 lofi, while without my changes its at around 407 samples/s.

rtawfik01 commented 8 months ago

Merged #5457, and GS performance for bert will be revisited at a future date to debug why some matmuls decreased performance by 4-7%

ttmtrajkovic commented 8 months ago

Hey @rtawfik01,

Could you please update the issue on what's done from this list and what's left?

rtawfik01 commented 8 months ago

First task is done, second task is halfway done in change: https://github.com/tenstorrent-metal/tt-metal/pull/5457