tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
399 stars 50 forks source link

[LLM] Optimize BW-bound matmuls #6689

Open cglagovich opened 6 months ago

cglagovich commented 6 months ago

Top level issue for LLM matmul optimizations

Llama2

Repro

branch: cglagovich/6689

./tt_metal/tools/profiler/profile_this.py -c "pytest -svv models/demos/llama2_70b/tests/perf/test_llama_matmul_perf.py::test_decode_ff1"
./tt_metal/tools/profiler/profile_this.py -c "pytest -svv models/demos/llama2_70b/tests/perf/test_llama_matmul_perf.py::test_decode_ff2"

The FFs in Llama2 decode are memory bandwidth bound. We measure that on main at 800 MHz, these matmuls are getting 8192 * 4096 * 1B / .242429 ms = 138 GB/s, short of the peak which is 200 GB/s @ 1GHz, or 160 GB/s @ 800 MHz (guessing that dram BW gets derated for lower clocks, please check).

I'm adding unit tests to main which use the exact input shapes, shard configs, and program configs that we use in the Llama demo.

cglagovich commented 6 months ago

I generated profiles with these commands on branch cglagovich/6689.

./tt_metal/tools/profiler/profile_this.py -c "pytest -svv models/demos/llama2_70b/tests/perf/test_llama_matmul_perf.py::test_decode_ff1"

ops_perf_results_2024_03_22_13_32_20.csv

./tt_metal/tools/profiler/profile_this.py -c "pytest -svv models/demos/llama2_70b/tests/perf/test_llama_matmul_perf.py::test_decode_ff2"

ops_perf_results_2024_03_22_13_33_10.csv

yieldthought commented 6 months ago

Mistral matmul test: https://github.com/tenstorrent-metal/tt-metal/blob/mistral-wh/models/demos/mistral7b/tests/test_mistral_mlp.py

32x4096x14336 and 32x14336x4096

cglagovich commented 6 months ago

Adding prefill MLP for 2k seqlen. Inputs are interleaved. For FF1 and FF3, we output block sharded. Poor perf, 30% to 50% math util at HiFi2.

./tt_metal/tools/profiler/profile_this.py -c "pytest -svv models/demos/llama2_70b/tests/perf/test_llama_matmul_perf.py::test_prefill_2k"

2k_mlp_perf.csv

cglagovich commented 5 months ago

I did some experiments and found that my FF2, the last row of this csv, will run as fast as the first two matmuls if I remove in0 reading from reader_bmm_tile_layout_in0_sender_padding.cpp. Seems that this is a dram BW problem on in0.

OP CODE,OP TYPE,GLOBAL CALL COUNT,DEVICE ID,ATTRIBUTES,MATH FIDELITY,CORE COUNT,PARALLELIZATION STRATEGY,HOST START TS,HOST END TS,HOST DURATION [ns],DEVICE FW START CYCLE,DEVICE FW END CYCLE,DEVICE FW DURATION [ns],DEVICE KERNEL DURATION [ns],DEVICE BRISC KERNEL DURATION [ns],DEVICE NCRISC KERNEL DURATION [ns],DEVICE TRISC0 KERNEL DURATION [ns],DEVICE TRISC1 KERNEL DURATION [ns],DEVICE TRISC2 KERNEL DURATION [ns],DEVICE ERISC KERNEL DURATION [ns],DEVICE COMPUTE CB WAIT FRONT [ns],DEVICE COMPUTE CB RESERVE BACK [ns],INPUT_0_DTYPE,INPUT_0_LAYOUT,INPUT_0_SHAPE,INPUT_0_STORAGE_TYPE,INPUT_1_DTYPE,INPUT_1_LAYOUT,INPUT_1_SHAPE,INPUT_1_STORAGE_TYPE,OUTPUT_0_DTYPE,OUTPUT_0_LAYOUT,OUTPUT_0_SHAPE,OUTPUT_0_STORAGE_TYPE,COMPUTE KERNEL PATH,COMPUTE KERNEL HASH,DATA MOVEMENT KERNEL PATH,DATA MOVEMENT KERNEL HASH,PM IDEAL [ns],PM COMPUTE [ns],PM BANDWIDTH [ns],PM REQ I BW,PM REQ O BW
tt::operations::primary::Matmul,tt_dnn_device,0,0,{'compute_kernel_config': 'tt::tt_metal::WormholeComputeKernelConfig(math_fidelity=HiFi2;math_approx_mode=1;fp32_dest_acc_en=1;packer_l1_acc=1)'; 'output_dtype': 'DataType::BFLOAT16'; 'output_mem_config': 'tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::BLOCK_SHARDED;buffer_type=BufferType::L1;shard_spec=tt::tt_metal::ShardSpec(shard_grid={[(x=0;y=0) - (x=7;y=7)]}; shard_shape={256; 512}; shard_orientation=ShardOrientation::ROW_MAJOR; halo=false))'; 'program_config': 'tt::operations::primary::MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8;y=8);in0_block_w=4;out_subblock_h=1;out_subblock_w=4;per_core_M=8;per_core_N=16;transpose_mcast=0;fused_activation=tt::tt_metal::UnaryWithParam(op_type=UnaryOpType::SILU;param=std::nullopt))'; 'untilize_out': 'false'},HiFi2,64,MatmulParallelizationStrategy::MULTI_CORE_REUSE_MCAST_2D_OPTIMIZED,33976734393,339767343931674702614,1674702614,199285923995,199287424374,1876671,1876291,1875320,1517864,1865296,1875925,1875474,,0,0,BFLOAT16,TILE,{'W': 1; 'X': 8192; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'DRAM'; 'memory_layout': 'INTERLEAVED'}},BFLOAT8_B,TILE,{'W': 1; 'X': 4096; 'Y': 8192; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'DRAM'; 'memory_layout': 'INTERLEAVED'}},BFLOAT16,TILE,{'W': 1; 'X': 4096; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'L1'; 'memory_layout': 'BLOCK_SHARDED'}},['tt_eager/tt_dnn/op_library/bmm/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp'],['bmm_large_block_zm_fused_bias_activation/17719509150936609826/'],['tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_sender_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_sender_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_receiver.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_receiver.cpp'],['reader_bmm_tile_layout_in0_sender_padding/493453508316899467/'; 'reader_bmm_tile_layout_in1_sender_writer_padding/17590841348518414052/'; 'reader_bmm_tile_layout_in1_receiver_writer_padding/8284160571900406338/'; 'reader_bmm_tile_layout_in0_receiver/7410396546790405071/'; 'reader_bmm_tile_layout_in1_receiver_writer_padding/6047920837611155654/'; 'reader_bmm_tile_layout_in0_receiver/7735611613369976534/'],1048576,1048576,121124,[32.0; 32.0],[16.0]
tt::operations::primary::Matmul,tt_dnn_device,1,0,{'compute_kernel_config': 'tt::tt_metal::WormholeComputeKernelConfig(math_fidelity=HiFi2;math_approx_mode=1;fp32_dest_acc_en=1;packer_l1_acc=1)'; 'output_dtype': 'DataType::BFLOAT16'; 'output_mem_config': 'tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::BLOCK_SHARDED;buffer_type=BufferType::L1;shard_spec=tt::tt_metal::ShardSpec(shard_grid={[(x=0;y=0) - (x=7;y=7)]}; shard_shape={256; 512}; shard_orientation=ShardOrientation::ROW_MAJOR; halo=false))'; 'program_config': 'tt::operations::primary::MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8;y=8);in0_block_w=4;out_subblock_h=1;out_subblock_w=4;per_core_M=8;per_core_N=16;transpose_mcast=0;fused_activation=std::nullopt)'; 'untilize_out': 'false'},HiFi2,64,MatmulParallelizationStrategy::MULTI_CORE_REUSE_MCAST_2D_OPTIMIZED,35651564377,356515643772983386538,2983386538,201622405705,201623663186,1573065,1572698,1571695,1518088,1571100,1571940,1571851,,0,0,BFLOAT16,TILE,{'W': 1; 'X': 8192; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'DRAM'; 'memory_layout': 'INTERLEAVED'}},BFLOAT8_B,TILE,{'W': 1; 'X': 4096; 'Y': 8192; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'DRAM'; 'memory_layout': 'INTERLEAVED'}},BFLOAT16,TILE,{'W': 1; 'X': 4096; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'L1'; 'memory_layout': 'BLOCK_SHARDED'}},['tt_eager/tt_dnn/op_library/bmm/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp'],['bmm_large_block_zm_fused_bias_activation/1856932064427074894/'],['tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_sender_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_sender_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_receiver.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_receiver.cpp'],['reader_bmm_tile_layout_in0_sender_padding/493453508316899467/'; 'reader_bmm_tile_layout_in1_sender_writer_padding/17590841348518414052/'; 'reader_bmm_tile_layout_in1_receiver_writer_padding/8284160571900406338/'; 'reader_bmm_tile_layout_in0_receiver/7410396546790405071/'; 'reader_bmm_tile_layout_in1_receiver_writer_padding/6047920837611155654/'; 'reader_bmm_tile_layout_in0_receiver/7735611613369976534/'],1048576,1048576,121124,[32.0; 32.0],[16.0]
tt::tt_metal::EltwiseBinary,tt_dnn_device,2,0,{'fused_activations': 'std::nullopt'; 'op_type': 'BinaryOpType::MUL'; 'output_dtype': 'DataType::BFLOAT16'; 'output_mem_config': 'tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::BLOCK_SHARDED;buffer_type=BufferType::L1;shard_spec=tt::tt_metal::ShardSpec(shard_grid={[(x=0;y=0) - (x=7;y=7)]}; shard_shape={256; 512}; shard_orientation=ShardOrientation::ROW_MAJOR; halo=false))'},HiFi4,64,BinaryOpParallelizationStrategy::MULTI_CORE,38635062305,386350623055071503181,5071503181,205596694126,205596706216,16226,15896,15140,411,14602,15060,15112,,0,0,BFLOAT16,TILE,{'W': 1; 'X': 4096; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'L1'; 'memory_layout': 'BLOCK_SHARDED'}},BFLOAT16,TILE,{'W': 1; 'X': 4096; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'L1'; 'memory_layout': 'BLOCK_SHARDED'}},BFLOAT16,TILE,{'W': 1; 'X': 4096; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'L1'; 'memory_layout': 'BLOCK_SHARDED'}},['tt_metal/kernels/compute/eltwise_binary.cpp'],['eltwise_binary/15461839133017137503/'],['tt_eager/tt_dnn/op_library/eltwise_binary/kernels/dataflow/reader_binary_interleaved_start_id.cpp'; 'tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp'],['reader_binary_interleaved_start_id/14113562297975412996/'; 'writer_unary_interleaved_start_id/17700749181845438146/'],3883,3883,1,[4320.68408203125; 4320.68408203125],[4320.68408203125]
tt::operations::primary::Matmul,tt_dnn_device,3,0,{'compute_kernel_config': 'tt::tt_metal::WormholeComputeKernelConfig(math_fidelity=HiFi2;math_approx_mode=1;fp32_dest_acc_en=1;packer_l1_acc=1)'; 'output_dtype': 'DataType::BFLOAT16'; 'output_mem_config': 'tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::BLOCK_SHARDED;buffer_type=BufferType::L1;shard_spec=tt::tt_metal::ShardSpec(shard_grid={[(x=0;y=0) - (x=7;y=7)]}; shard_shape={256; 128}; shard_orientation=ShardOrientation::ROW_MAJOR; halo=false))'; 'program_config': 'tt::operations::primary::MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8;y=8);in0_block_w=4;out_subblock_h=1;out_subblock_w=4;per_core_M=8;per_core_N=4;transpose_mcast=0;fused_activation=std::nullopt)'; 'untilize_out': 'false'},HiFi2,64,MatmulParallelizationStrategy::MULTI_CORE_REUSE_MCAST_2D_OPTIMIZED,43706806075,437068060755183997586,5183997586,209655932592,209658234997,2879204,2878822,2877835,2867638,2877250,2878064,2878006,,0,0,BFLOAT16,TILE,{'W': 1; 'X': 32768; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'DRAM'; 'memory_layout': 'INTERLEAVED'}},BFLOAT8_B,TILE,{'W': 1; 'X': 1024; 'Y': 32768; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'DRAM'; 'memory_layout': 'INTERLEAVED'}},BFLOAT16,TILE,{'W': 1; 'X': 1024; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'L1'; 'memory_layout': 'BLOCK_SHARDED'}},['tt_eager/tt_dnn/op_library/bmm/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp'],['bmm_large_block_zm_fused_bias_activation/6468923197366004279/'],['tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_sender_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_sender_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_receiver.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_receiver.cpp'],['reader_bmm_tile_layout_in0_sender_padding/14282349652643983886/'; 'reader_bmm_tile_layout_in1_sender_writer_padding/3343544542602451325/'; 'reader_bmm_tile_layout_in1_receiver_writer_padding/6008783467348593219/'; 'reader_bmm_tile_layout_in0_receiver/16825977614607082641/'; 'reader_bmm_tile_layout_in1_receiver_writer_padding/708227015693400764/'; 'reader_bmm_tile_layout_in0_receiver/6196945722367889341/'],1048576,1048576,484496,[128.0; 32.0],[4.0]
cglagovichTT commented 5 months ago

How is this progressing? Are there any perf improvements the models teams can pick up?

tt-rkim commented 4 months ago

Note that a new acceptance criteria is that this test must pass without ND hangs on:

A good signal for this is pytest tests/ttnn/integration_tests/bert/test_performance.py::test_performance[bert=models.demos.bert.tt.ttnn_optimized_sharded_bert-sequence_size=384-model_name=phiyodr/bert-large-finetuned-squad2] ran in a loop on Grayskull.

Please test locally whenever possible. Otherwise this will block other devs' machines, requiring host reboots.

cc: @jliangTT @TT-billteng @tt-aho @mo-tenstorrent

tt-rkim commented 4 months ago

With the first set of changes for this issue that we later reverted, we expected to see the hang on a GS cloud BM within 3 iterations of the test.

It required a reboot to reset the card.

yugaoTT commented 3 months ago

@cglagovich the matmul with dram sharded weight has merged, there's example in tests/ttnn/unit_tests/operations/test_experimental.py and tests/tt_eager/python_api_testing/unit_testing/misc/test_matmul_dram_sharded.py for Llama3, bfp8 weights, we should div in0_block_w by 4, for qkv, ff1, ff3. for Llama3, bfp8 weights, we should div in0_block_w by 2 for ff2.