Kernel API clean up - Githubissues

acejkov commented 8 months ago

Based on kernel review here are few items for clean up:

[x] Remove face layout from HLK api layer and don't expose ability to switch between col and row major layout. WH_B0 only supports row major layout for all the ops. Grayskull uses col major layout only for matmul op to gain ~3% extra math utilization in the worst case scenario (LoFi, bfp8) but introduces overhead and complexity as we need to re-program packer dest offset registers as we switch between matmul and other ops, Merged here: https://github.com/tenstorrent-metal/tt-metal/pull/5457
[x] Without col major layout we don't need additional API calls to load partial results. Following can be removed and replaced with data copy

tt_metal/include/compute_kernel_api/tile_move_copy.h:ALWI void copy_tile_matmul_partials_init_short_with_dt(uint32_t cbid) { tt_metal/include/compute_kernel_api/tile_move_copy.h:ALWI void copy_tile_matmul_partials_init_short_with_dt(uint32_t old_cbid, uint32_t new_cbid) { tt_metal/include/compute_kernel_api/tile_move_copy.h:ALWI void copy_block_matmul_partials(uint32_t icb, uint32_t start_itile, uint32_t start_idst, uint32_t ntiles

[ ] Unify all init calls

a) Remove short/long inits as we don't need to reprogram dest registers and have only single init per op b) Replace init_once with hw configs that need to be inserted once per kernel run at the start of the kernel. This can be cleary stated in the programming guideline c) Review different versions of inits (_dt, block etc) and merge then into single init where doable
d) Remove init calls which flip between row and col major face layout

[ ] Review block level API functions to understand performance gain. Unless loops which process block tile by tile are pushed down to tensix core there is no expectations for performance gain. Otherwise blocks should be for loops inside HLK kernel instead of dedicated API calls which require maintains. The only op that works with blocks on the tensix core is wh matmul to take advantage of re-use when loading data into srcA/B registers

davorchap commented 8 months ago

This all looks great, let's do it.

rtawfik01 commented 8 months ago

I have the changes for item 1 here: https://github.com/tenstorrent-metal/tt-metal/pull/5457

It does cause some device performance regressions to fail:

2024-02-19 06:25:17.245 | ERROR    | models.perf.device_perf_utils:check_device_perf_results:131 - bert11_BERT_LARGE-batch_12-BFLOAT8_B-SHARDED AVG DEVICE KERNEL SAMPLES/S is too slow with 396.8593, min expected 397.7.
 2024-02-19 07:26:41.263 | ERROR    | models.perf.device_perf_utils:check_device_perf_results:131 - resnet50_batch_size20_HiFi2-activations_BFLOAT8_B-weights_BFLOAT8_B-batch_20 AVG DEVICE KERNEL SAMPLES/S is too slow with 5559.3092, min expected 5567.8.

I decreased the lower bounds of the above models to pass the pipelines.

I also reviewed the performance of optimized convs that were using colmajor matmuls to compare with rowmajor matmuls:

and there is performance degradation for Bfp8 LoFi convs that take ~10k ns. So @davorchap @TT-BrianLiu let me know if this performance degradation is accepted and can be pushed into the pipelines

TT-BrianLiu commented 8 months ago

@rtawfik01 seems reasonable to me

davorchap commented 8 months ago

What's new device perf for BERT-L and RN50 lofi/bfp8_b ?

rtawfik01 commented 8 months ago

What's new device perf for BERT-L and RN50 lofi/bfp8_b ?

This is what I see for lofi/bfp8_b bert from the device performance regression:

 2024-02-23 01:38:10.347 | INFO     | models.perf.device_perf_utils:check_device_perf_results:117 - {'Model': 'bert11', 'Setting': 'BERT_LARGE-batch_12-BFLOAT8_B-SHARDED', 'Batch': '12', 'AVG DEVICE FW SAMPLES/S': '395.6074', 'MIN DEVICE FW SAMPLES/S': '395.5490', 'MAX DEVICE FW SAMPLES/S': '395.6615', 'AVG DEVICE KERNEL SAMPLES/S': '396.8989', 'Lower Threshold AVG DEVICE KERNEL SAMPLES/S': '378.3000', 'Upper Threshold AVG DEVICE KERNEL SAMPLES/S': '401.7000', 'MIN DEVICE KERNEL SAMPLES/S': '396.8392', 'MAX DEVICE KERNEL SAMPLES/S': '396.9539', 'AVG DEVICE BRISC KERNEL SAMPLES/S': '539.6978', 'MIN DEVICE BRISC KERNEL SAMPLES/S': '539.5799', 'MAX DEVICE BRISC KERNEL SAMPLES/S': '539.7987'}

and this is what I see for resnet:


2024-02-23 01:38:10.347 | INFO     | models.perf.device_perf_utils:check_device_perf_results:117 - {'Model': 'resnet50_batch_size16', 'Setting': 'LoFi-activations_BFLOAT8_B-weights_BFLOAT8_B-batch_16', 'Batch': '16', 'AVG DEVICE FW SAMPLES/S': '6090.6388', 'MIN DEVICE FW SAMPLES/S': '6068.7644', 'MAX DEVICE FW SAMPLES/S': '6130.3128', 'AVG DEVICE KERNEL SAMPLES/S': '6167.9524', 'Lower Threshold AVG DEVICE KERNEL SAMPLES/S': '5975.2000', 'Upper Threshold AVG DEVICE KERNEL SAMPLES/S': '6344.8000', 'MIN DEVICE KERNEL SAMPLES/S': '6145.5331', 'MAX DEVICE KERNEL SAMPLES/S': '6208.6976', 'AVG DEVICE BRISC KERNEL SAMPLES/S': '6373.3454', 'MIN DEVICE BRISC KERNEL SAMPLES/S': '6349.9976', 'MAX DEVICE BRISC KERNEL SAMPLES/S': '6416.8970'}

2024-02-23 01:38:10.348 | INFO     | models.perf.device_perf_utils:check_device_perf_results:117 - {'Model': 'resnet50_batch_size20', 'Setting': 'LoFi-activations_BFLOAT8_B-weights_BFLOAT8_B-batch_20', 'Batch': '20', 'AVG DEVICE FW SAMPLES/S': '6623.5572', 'MIN DEVICE FW SAMPLES/S': '6598.3691', 'MAX DEVICE FW SAMPLES/S': '6644.0216', 'AVG DEVICE KERNEL SAMPLES/S': '6701.5628', 'Lower Threshold AVG DEVICE KERNEL SAMPLES/S': '6469.9000', 'Upper Threshold AVG DEVICE KERNEL SAMPLES/S': '6870.1000', 'MIN DEVICE KERNEL SAMPLES/S': '6675.8482', 'MAX DEVICE KERNEL SAMPLES/S': '6722.6755', 'AVG DEVICE BRISC KERNEL SAMPLES/S': '6945.7269', 'MIN DEVICE BRISC KERNEL SAMPLES/S': '6917.7846', 'MAX DEVICE BRISC KERNEL SAMPLES/S': '6968.0341'}

rtawfik01 commented 8 months ago

Hi @davorchap, after some more cleanup, I now get resnet to run 6.7-6.8k fps with lofi/bfp8 batch 20 with my changes and it no longer fails in the device perf regressions, so I no longer need to reduce the lower bounds for resnet.

For bert I still need to reduce the lower bound, since I get 398 samples/s for batch 12 bfp8 lofi, while without my changes its at around 407 samples/s.

rtawfik01 commented 8 months ago

Merged #5457, and GS performance for bert will be revisited at a future date to debug why some matmuls decreased performance by 4-7%

ttmtrajkovic commented 8 months ago

Hey @rtawfik01,

Could you please update the issue on what's done from this list and what's left?

rtawfik01 commented 8 months ago

First task is done, second task is halfway done in change: https://github.com/tenstorrent-metal/tt-metal/pull/5457

tenstorrent / tt-metal

Kernel API clean up #5420