Open cglagovich opened 6 months ago
I generated profiles with these commands on branch cglagovich/6689
.
./tt_metal/tools/profiler/profile_this.py -c "pytest -svv models/demos/llama2_70b/tests/perf/test_llama_matmul_perf.py::test_decode_ff1"
ops_perf_results_2024_03_22_13_32_20.csv
./tt_metal/tools/profiler/profile_this.py -c "pytest -svv models/demos/llama2_70b/tests/perf/test_llama_matmul_perf.py::test_decode_ff2"
Mistral matmul test: https://github.com/tenstorrent-metal/tt-metal/blob/mistral-wh/models/demos/mistral7b/tests/test_mistral_mlp.py
32x4096x14336 and 32x14336x4096
Adding prefill MLP for 2k seqlen. Inputs are interleaved. For FF1 and FF3, we output block sharded. Poor perf, 30% to 50% math util at HiFi2.
./tt_metal/tools/profiler/profile_this.py -c "pytest -svv models/demos/llama2_70b/tests/perf/test_llama_matmul_perf.py::test_prefill_2k"
I did some experiments and found that my FF2, the last row of this csv, will run as fast as the first two matmuls if I remove in0 reading from reader_bmm_tile_layout_in0_sender_padding.cpp
. Seems that this is a dram BW problem on in0.
OP CODE,OP TYPE,GLOBAL CALL COUNT,DEVICE ID,ATTRIBUTES,MATH FIDELITY,CORE COUNT,PARALLELIZATION STRATEGY,HOST START TS,HOST END TS,HOST DURATION [ns],DEVICE FW START CYCLE,DEVICE FW END CYCLE,DEVICE FW DURATION [ns],DEVICE KERNEL DURATION [ns],DEVICE BRISC KERNEL DURATION [ns],DEVICE NCRISC KERNEL DURATION [ns],DEVICE TRISC0 KERNEL DURATION [ns],DEVICE TRISC1 KERNEL DURATION [ns],DEVICE TRISC2 KERNEL DURATION [ns],DEVICE ERISC KERNEL DURATION [ns],DEVICE COMPUTE CB WAIT FRONT [ns],DEVICE COMPUTE CB RESERVE BACK [ns],INPUT_0_DTYPE,INPUT_0_LAYOUT,INPUT_0_SHAPE,INPUT_0_STORAGE_TYPE,INPUT_1_DTYPE,INPUT_1_LAYOUT,INPUT_1_SHAPE,INPUT_1_STORAGE_TYPE,OUTPUT_0_DTYPE,OUTPUT_0_LAYOUT,OUTPUT_0_SHAPE,OUTPUT_0_STORAGE_TYPE,COMPUTE KERNEL PATH,COMPUTE KERNEL HASH,DATA MOVEMENT KERNEL PATH,DATA MOVEMENT KERNEL HASH,PM IDEAL [ns],PM COMPUTE [ns],PM BANDWIDTH [ns],PM REQ I BW,PM REQ O BW
tt::operations::primary::Matmul,tt_dnn_device,0,0,{'compute_kernel_config': 'tt::tt_metal::WormholeComputeKernelConfig(math_fidelity=HiFi2;math_approx_mode=1;fp32_dest_acc_en=1;packer_l1_acc=1)'; 'output_dtype': 'DataType::BFLOAT16'; 'output_mem_config': 'tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::BLOCK_SHARDED;buffer_type=BufferType::L1;shard_spec=tt::tt_metal::ShardSpec(shard_grid={[(x=0;y=0) - (x=7;y=7)]}; shard_shape={256; 512}; shard_orientation=ShardOrientation::ROW_MAJOR; halo=false))'; 'program_config': 'tt::operations::primary::MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8;y=8);in0_block_w=4;out_subblock_h=1;out_subblock_w=4;per_core_M=8;per_core_N=16;transpose_mcast=0;fused_activation=tt::tt_metal::UnaryWithParam(op_type=UnaryOpType::SILU;param=std::nullopt))'; 'untilize_out': 'false'},HiFi2,64,MatmulParallelizationStrategy::MULTI_CORE_REUSE_MCAST_2D_OPTIMIZED,33976734393,339767343931674702614,1674702614,199285923995,199287424374,1876671,1876291,1875320,1517864,1865296,1875925,1875474,,0,0,BFLOAT16,TILE,{'W': 1; 'X': 8192; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'DRAM'; 'memory_layout': 'INTERLEAVED'}},BFLOAT8_B,TILE,{'W': 1; 'X': 4096; 'Y': 8192; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'DRAM'; 'memory_layout': 'INTERLEAVED'}},BFLOAT16,TILE,{'W': 1; 'X': 4096; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'L1'; 'memory_layout': 'BLOCK_SHARDED'}},['tt_eager/tt_dnn/op_library/bmm/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp'],['bmm_large_block_zm_fused_bias_activation/17719509150936609826/'],['tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_sender_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_sender_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_receiver.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_receiver.cpp'],['reader_bmm_tile_layout_in0_sender_padding/493453508316899467/'; 'reader_bmm_tile_layout_in1_sender_writer_padding/17590841348518414052/'; 'reader_bmm_tile_layout_in1_receiver_writer_padding/8284160571900406338/'; 'reader_bmm_tile_layout_in0_receiver/7410396546790405071/'; 'reader_bmm_tile_layout_in1_receiver_writer_padding/6047920837611155654/'; 'reader_bmm_tile_layout_in0_receiver/7735611613369976534/'],1048576,1048576,121124,[32.0; 32.0],[16.0]
tt::operations::primary::Matmul,tt_dnn_device,1,0,{'compute_kernel_config': 'tt::tt_metal::WormholeComputeKernelConfig(math_fidelity=HiFi2;math_approx_mode=1;fp32_dest_acc_en=1;packer_l1_acc=1)'; 'output_dtype': 'DataType::BFLOAT16'; 'output_mem_config': 'tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::BLOCK_SHARDED;buffer_type=BufferType::L1;shard_spec=tt::tt_metal::ShardSpec(shard_grid={[(x=0;y=0) - (x=7;y=7)]}; shard_shape={256; 512}; shard_orientation=ShardOrientation::ROW_MAJOR; halo=false))'; 'program_config': 'tt::operations::primary::MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8;y=8);in0_block_w=4;out_subblock_h=1;out_subblock_w=4;per_core_M=8;per_core_N=16;transpose_mcast=0;fused_activation=std::nullopt)'; 'untilize_out': 'false'},HiFi2,64,MatmulParallelizationStrategy::MULTI_CORE_REUSE_MCAST_2D_OPTIMIZED,35651564377,356515643772983386538,2983386538,201622405705,201623663186,1573065,1572698,1571695,1518088,1571100,1571940,1571851,,0,0,BFLOAT16,TILE,{'W': 1; 'X': 8192; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'DRAM'; 'memory_layout': 'INTERLEAVED'}},BFLOAT8_B,TILE,{'W': 1; 'X': 4096; 'Y': 8192; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'DRAM'; 'memory_layout': 'INTERLEAVED'}},BFLOAT16,TILE,{'W': 1; 'X': 4096; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'L1'; 'memory_layout': 'BLOCK_SHARDED'}},['tt_eager/tt_dnn/op_library/bmm/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp'],['bmm_large_block_zm_fused_bias_activation/1856932064427074894/'],['tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_sender_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_sender_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_receiver.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_receiver.cpp'],['reader_bmm_tile_layout_in0_sender_padding/493453508316899467/'; 'reader_bmm_tile_layout_in1_sender_writer_padding/17590841348518414052/'; 'reader_bmm_tile_layout_in1_receiver_writer_padding/8284160571900406338/'; 'reader_bmm_tile_layout_in0_receiver/7410396546790405071/'; 'reader_bmm_tile_layout_in1_receiver_writer_padding/6047920837611155654/'; 'reader_bmm_tile_layout_in0_receiver/7735611613369976534/'],1048576,1048576,121124,[32.0; 32.0],[16.0]
tt::tt_metal::EltwiseBinary,tt_dnn_device,2,0,{'fused_activations': 'std::nullopt'; 'op_type': 'BinaryOpType::MUL'; 'output_dtype': 'DataType::BFLOAT16'; 'output_mem_config': 'tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::BLOCK_SHARDED;buffer_type=BufferType::L1;shard_spec=tt::tt_metal::ShardSpec(shard_grid={[(x=0;y=0) - (x=7;y=7)]}; shard_shape={256; 512}; shard_orientation=ShardOrientation::ROW_MAJOR; halo=false))'},HiFi4,64,BinaryOpParallelizationStrategy::MULTI_CORE,38635062305,386350623055071503181,5071503181,205596694126,205596706216,16226,15896,15140,411,14602,15060,15112,,0,0,BFLOAT16,TILE,{'W': 1; 'X': 4096; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'L1'; 'memory_layout': 'BLOCK_SHARDED'}},BFLOAT16,TILE,{'W': 1; 'X': 4096; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'L1'; 'memory_layout': 'BLOCK_SHARDED'}},BFLOAT16,TILE,{'W': 1; 'X': 4096; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'L1'; 'memory_layout': 'BLOCK_SHARDED'}},['tt_metal/kernels/compute/eltwise_binary.cpp'],['eltwise_binary/15461839133017137503/'],['tt_eager/tt_dnn/op_library/eltwise_binary/kernels/dataflow/reader_binary_interleaved_start_id.cpp'; 'tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp'],['reader_binary_interleaved_start_id/14113562297975412996/'; 'writer_unary_interleaved_start_id/17700749181845438146/'],3883,3883,1,[4320.68408203125; 4320.68408203125],[4320.68408203125]
tt::operations::primary::Matmul,tt_dnn_device,3,0,{'compute_kernel_config': 'tt::tt_metal::WormholeComputeKernelConfig(math_fidelity=HiFi2;math_approx_mode=1;fp32_dest_acc_en=1;packer_l1_acc=1)'; 'output_dtype': 'DataType::BFLOAT16'; 'output_mem_config': 'tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::BLOCK_SHARDED;buffer_type=BufferType::L1;shard_spec=tt::tt_metal::ShardSpec(shard_grid={[(x=0;y=0) - (x=7;y=7)]}; shard_shape={256; 128}; shard_orientation=ShardOrientation::ROW_MAJOR; halo=false))'; 'program_config': 'tt::operations::primary::MatmulMultiCoreReuseMultiCastProgramConfig(compute_with_storage_grid_size=(x=8;y=8);in0_block_w=4;out_subblock_h=1;out_subblock_w=4;per_core_M=8;per_core_N=4;transpose_mcast=0;fused_activation=std::nullopt)'; 'untilize_out': 'false'},HiFi2,64,MatmulParallelizationStrategy::MULTI_CORE_REUSE_MCAST_2D_OPTIMIZED,43706806075,437068060755183997586,5183997586,209655932592,209658234997,2879204,2878822,2877835,2867638,2877250,2878064,2878006,,0,0,BFLOAT16,TILE,{'W': 1; 'X': 32768; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'DRAM'; 'memory_layout': 'INTERLEAVED'}},BFLOAT8_B,TILE,{'W': 1; 'X': 1024; 'Y': 32768; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'DRAM'; 'memory_layout': 'INTERLEAVED'}},BFLOAT16,TILE,{'W': 1; 'X': 1024; 'Y': 2048; 'Z': 1},{'device_id': 0; 'memory_config': {'buffer_type': 'L1'; 'memory_layout': 'BLOCK_SHARDED'}},['tt_eager/tt_dnn/op_library/bmm/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp'],['bmm_large_block_zm_fused_bias_activation/6468923197366004279/'],['tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_sender_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_sender_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_receiver.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp'; 'tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_receiver.cpp'],['reader_bmm_tile_layout_in0_sender_padding/14282349652643983886/'; 'reader_bmm_tile_layout_in1_sender_writer_padding/3343544542602451325/'; 'reader_bmm_tile_layout_in1_receiver_writer_padding/6008783467348593219/'; 'reader_bmm_tile_layout_in0_receiver/16825977614607082641/'; 'reader_bmm_tile_layout_in1_receiver_writer_padding/708227015693400764/'; 'reader_bmm_tile_layout_in0_receiver/6196945722367889341/'],1048576,1048576,484496,[128.0; 32.0],[4.0]
How is this progressing? Are there any perf improvements the models teams can pick up?
Note that a new acceptance criteria is that this test must pass without ND hangs on:
A good signal for this is pytest tests/ttnn/integration_tests/bert/test_performance.py::test_performance[bert=models.demos.bert.tt.ttnn_optimized_sharded_bert-sequence_size=384-model_name=phiyodr/bert-large-finetuned-squad2]
ran in a loop on Grayskull.
Please test locally whenever possible. Otherwise this will block other devs' machines, requiring host reboots.
cc: @jliangTT @TT-billteng @tt-aho @mo-tenstorrent
With the first set of changes for this issue that we later reverted, we expected to see the hang on a GS cloud BM within 3 iterations of the test.
It required a reboot to reset the card.
@cglagovich the matmul with dram sharded weight has merged, there's example in tests/ttnn/unit_tests/operations/test_experimental.py
and tests/tt_eager/python_api_testing/unit_testing/misc/test_matmul_dram_sharded.py
for Llama3, bfp8 weights, we should div in0_block_w
by 4, for qkv, ff1, ff3.
for Llama3, bfp8 weights, we should div in0_block_w
by 2 for ff2.
Top level issue for LLM matmul optimizations
Llama2
Repro
branch: cglagovich/6689
The FFs in Llama2 decode are memory bandwidth bound. We measure that on main at 800 MHz, these matmuls are getting
8192 * 4096 * 1B / .242429 ms = 138 GB/s
, short of the peak which is 200 GB/s @ 1GHz, or 160 GB/s @ 800 MHz (guessing that dram BW gets derated for lower clocks, please check).I'm adding unit tests to main which use the exact input shapes, shard configs, and program configs that we use in the Llama demo.