Open acejkov opened 8 months ago
This all looks great, let's do it.
I have the changes for item 1 here: https://github.com/tenstorrent-metal/tt-metal/pull/5457
It does cause some device performance regressions to fail:
2024-02-19 06:25:17.245 | ERROR | models.perf.device_perf_utils:check_device_perf_results:131 - bert11_BERT_LARGE-batch_12-BFLOAT8_B-SHARDED AVG DEVICE KERNEL SAMPLES/S is too slow with 396.8593, min expected 397.7.
2024-02-19 07:26:41.263 | ERROR | models.perf.device_perf_utils:check_device_perf_results:131 - resnet50_batch_size20_HiFi2-activations_BFLOAT8_B-weights_BFLOAT8_B-batch_20 AVG DEVICE KERNEL SAMPLES/S is too slow with 5559.3092, min expected 5567.8.
I decreased the lower bounds of the above models to pass the pipelines.
I also reviewed the performance of optimized convs that were using colmajor matmuls to compare with rowmajor matmuls:
and there is performance degradation for Bfp8 LoFi convs that take ~10k ns. So @davorchap @TT-BrianLiu let me know if this performance degradation is accepted and can be pushed into the pipelines
@rtawfik01 seems reasonable to me
What's new device perf for BERT-L and RN50 lofi/bfp8_b ?
What's new device perf for BERT-L and RN50 lofi/bfp8_b ?
This is what I see for lofi/bfp8_b bert from the device performance regression:
2024-02-23 01:38:10.347 | INFO | models.perf.device_perf_utils:check_device_perf_results:117 - {'Model': 'bert11', 'Setting': 'BERT_LARGE-batch_12-BFLOAT8_B-SHARDED', 'Batch': '12', 'AVG DEVICE FW SAMPLES/S': '395.6074', 'MIN DEVICE FW SAMPLES/S': '395.5490', 'MAX DEVICE FW SAMPLES/S': '395.6615', 'AVG DEVICE KERNEL SAMPLES/S': '396.8989', 'Lower Threshold AVG DEVICE KERNEL SAMPLES/S': '378.3000', 'Upper Threshold AVG DEVICE KERNEL SAMPLES/S': '401.7000', 'MIN DEVICE KERNEL SAMPLES/S': '396.8392', 'MAX DEVICE KERNEL SAMPLES/S': '396.9539', 'AVG DEVICE BRISC KERNEL SAMPLES/S': '539.6978', 'MIN DEVICE BRISC KERNEL SAMPLES/S': '539.5799', 'MAX DEVICE BRISC KERNEL SAMPLES/S': '539.7987'}
and this is what I see for resnet:
2024-02-23 01:38:10.347 | INFO | models.perf.device_perf_utils:check_device_perf_results:117 - {'Model': 'resnet50_batch_size16', 'Setting': 'LoFi-activations_BFLOAT8_B-weights_BFLOAT8_B-batch_16', 'Batch': '16', 'AVG DEVICE FW SAMPLES/S': '6090.6388', 'MIN DEVICE FW SAMPLES/S': '6068.7644', 'MAX DEVICE FW SAMPLES/S': '6130.3128', 'AVG DEVICE KERNEL SAMPLES/S': '6167.9524', 'Lower Threshold AVG DEVICE KERNEL SAMPLES/S': '5975.2000', 'Upper Threshold AVG DEVICE KERNEL SAMPLES/S': '6344.8000', 'MIN DEVICE KERNEL SAMPLES/S': '6145.5331', 'MAX DEVICE KERNEL SAMPLES/S': '6208.6976', 'AVG DEVICE BRISC KERNEL SAMPLES/S': '6373.3454', 'MIN DEVICE BRISC KERNEL SAMPLES/S': '6349.9976', 'MAX DEVICE BRISC KERNEL SAMPLES/S': '6416.8970'}
2024-02-23 01:38:10.348 | INFO | models.perf.device_perf_utils:check_device_perf_results:117 - {'Model': 'resnet50_batch_size20', 'Setting': 'LoFi-activations_BFLOAT8_B-weights_BFLOAT8_B-batch_20', 'Batch': '20', 'AVG DEVICE FW SAMPLES/S': '6623.5572', 'MIN DEVICE FW SAMPLES/S': '6598.3691', 'MAX DEVICE FW SAMPLES/S': '6644.0216', 'AVG DEVICE KERNEL SAMPLES/S': '6701.5628', 'Lower Threshold AVG DEVICE KERNEL SAMPLES/S': '6469.9000', 'Upper Threshold AVG DEVICE KERNEL SAMPLES/S': '6870.1000', 'MIN DEVICE KERNEL SAMPLES/S': '6675.8482', 'MAX DEVICE KERNEL SAMPLES/S': '6722.6755', 'AVG DEVICE BRISC KERNEL SAMPLES/S': '6945.7269', 'MIN DEVICE BRISC KERNEL SAMPLES/S': '6917.7846', 'MAX DEVICE BRISC KERNEL SAMPLES/S': '6968.0341'}
Hi @davorchap, after some more cleanup, I now get resnet to run 6.7-6.8k fps with lofi/bfp8 batch 20 with my changes and it no longer fails in the device perf regressions, so I no longer need to reduce the lower bounds for resnet.
For bert I still need to reduce the lower bound, since I get 398 samples/s for batch 12 bfp8 lofi, while without my changes its at around 407 samples/s.
Merged #5457, and GS performance for bert will be revisited at a future date to debug why some matmuls decreased performance by 4-7%
Hey @rtawfik01,
Could you please update the issue on what's done from this list and what's left?
First task is done, second task is halfway done in change: https://github.com/tenstorrent-metal/tt-metal/pull/5457
Based on kernel review here are few items for clean up:
[x] Remove face layout from HLK api layer and don't expose ability to switch between col and row major layout. WH_B0 only supports row major layout for all the ops. Grayskull uses col major layout only for matmul op to gain ~3% extra math utilization in the worst case scenario (LoFi, bfp8) but introduces overhead and complexity as we need to re-program packer dest offset registers as we switch between matmul and other ops, Merged here: https://github.com/tenstorrent-metal/tt-metal/pull/5457
[x] Without col major layout we don't need additional API calls to load partial results. Following can be removed and replaced with data copy
tt_metal/include/compute_kernel_api/tile_move_copy.h:ALWI void copy_tile_matmul_partials_init_short_with_dt(uint32_t cbid) { tt_metal/include/compute_kernel_api/tile_move_copy.h:ALWI void copy_tile_matmul_partials_init_short_with_dt(uint32_t old_cbid, uint32_t new_cbid) { tt_metal/include/compute_kernel_api/tile_move_copy.h:ALWI void copy_block_matmul_partials(uint32_t icb, uint32_t start_itile, uint32_t start_idst, uint32_t ntiles
a) Remove short/long inits as we don't need to reprogram dest registers and have only single init per op b) Replace init_once with hw configs that need to be inserted once per kernel run at the start of the kernel. This can be cleary stated in the programming guideline c) Review different versions of inits (_dt, block etc) and merge then into single init where doable
d) Remove init calls which flip between row and col major face layout