Open rtawfik01 opened 3 weeks ago
Adding another example of tests here:
tests/tt_eager/python_api_testing/unit_testing/misc/test_optimized_conv_v2.py::test_optimized_conv_v2
Fails with grid size failures:
python3: ../ttnn/cpp/ttnn/operations/conv/conv2d/device/multi_core_optimized_conv_sharded/optimized_conv_op_sharded_v2.cpp:456: operation::ProgramWithCallbacks ttnn::operations::conv::conv2d::multi_core_optimized_conv_sharded_v2_impl(tt_metal::Program &, const Tensor &, const Tensor &, const Shape &, std::optional<const Tensor>, const std::optional<const Tensor>, vector<int>, uint32_t, bool, bool, bool, const OptimizedConvParallelizationConfig &, const OptimizedConvBlockConfig &, uint32_t, bool, bool, Tensor &, DeviceComputeKernelConfig, bool, bool, bool): Assertion `num_cores_x < 13' failed.
More failures of BH grid size:
def copy_output_from_device(self, output_d: ttnn.Tensor):
interleaved_mem_config = ttnn.MemoryConfig(ttnn.TensorMemoryLayout.INTERLEAVED, ttnn.BufferType.DRAM)
> output_d = ttl.tensor.sharded_to_interleaved(output_d, interleaved_mem_config)
E RuntimeError: TT_FATAL @ ../ttnn/cpp/ttnn/deprecated/tt_dnn/op_library/sharded/sharded_op.cpp:33: (*input_tensor.memory_config().shard_spec).shape[1] * input_tensor.element_size() % (this->output_mem_config.buffer_type == BufferType::DRAM ? DRAM_ALIGNMENT : L1_ALIGNMENT) == 0
E info:
E Shard page size must be aligned to output buffer type alignment
E backtrace:
Test:
pytest -svv tests/ttnn/unit_tests/operations/test_max_pool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-nblocks=1-dilation=(1, 1)-stride=(2, 2)-padding=(1, 1)-kernel_size=(3, 3)-act_shape=[8, 16, 528, 80]-device_params={'l1_small_size': 24576}] 2024-08-23 15:02:23.778 | DEBUG | ttnn:manage_config:90 - Set ttnn.CONFIG.report_name to tests/ttnn/unit_tests/operations/test_max_pool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-nblocks=1-dilation=(1, 1)-stride=(2, 2)-padding=(1, 1)-kernel_size=(3, 3)-act_shape=[8, 16, 528, 80]-device_params={'l1_small_size': 24576}]
@mywoodstock fyi
With the latest main
, when I try the above maxpool test, I get a different error:
Always | FATAL | Generated prefetcher command of size 131136 B exceeds max command size 131072 B
FAILED
Any idea about this @abhullar-tt ? I don't know much about prefetcher commands.
With the latest
main
, when I try the above maxpool test, I get a different error:Always | FATAL | Generated prefetcher command of size 131136 B exceeds max command size 131072 B FAILED
Any idea about this @abhullar-tt ? I don't know much about prefetcher commands.
Fix is on https://github.com/tenstorrent/tt-metal/pull/11983
Update on maxpool side:
All unit tests pass except the following 6 (they have channel depth 16) which get bad PCC:
FAILED tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(0, 0)-kernel_size=(2, 2)-act_shape=[4, 16, 1056, 160]-device_params={'l1_small_size': 24576}] - AssertionError: 0.0002867184106536119
FAILED tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(0, 0)-kernel_size=(2, 2)-act_shape=[8, 16, 528, 80]-device_params={'l1_small_size': 24576}] - AssertionError: 0.05038920860128919
FAILED tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(0, 0)-kernel_size=(2, 2)-act_shape=[16, 16, 528, 80]-device_params={'l1_small_size': 24576}] - AssertionError: -0.0006334264974595234
FAILED tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(1, 1)-kernel_size=(3, 3)-act_shape=[4, 16, 1056, 160]-device_params={'l1_small_size': 24576}] - AssertionError: 0.0007946853972681697
FAILED tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(1, 1)-kernel_size=(3, 3)-act_shape=[8, 16, 528, 80]-device_params={'l1_small_size': 24576}] - AssertionError: 0.035399027289953706
FAILED tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(1, 1)-kernel_size=(3, 3)-act_shape=[16, 16, 528, 80]-device_params={'l1_small_size': 24576}] - AssertionError: -0.00020153143760114889
Channel depth of 16 works when input is BFLOAT8_B, but fails with BFLOAT16, which is very sus -- i will try to see whats going on.
Still trying to debug the maxpool PCC for these cases. Basically the exact same code on same number of cores on WH is all good, but BH is giving almost completely mismatching results. The output of halo itself is not correct, so maxpool kernel is probably fine. But since halo is a TM, not sure what can be going wrong -- perhaps some address calculation... will continue debugging, but @rtawfik01 if you have any potential suspects, lmk.
@mywoodstock can you post the compile time/runtime args that are passed to the compute kernels themselves? There might be a codepath that was not exercised?
Here is the latest. The following happens only with BFLOAT16 input data with channel depth of 16:
The output of halo op looked wrong, but then when i tried to print the input pages themselves, I get wrong data from the reader kernel itself. This is strange because printing the input tensor before the halo op shows everything is in order. The exact same code/kernel on WH works just fine, and the input data is all good there.
@abhullar-tt @rtawfik01 may be you have an idea of something that could go wrong on BH. Below are the steps to repro the observation through a test case:
Branch:
asarje/debug-halo-bh
Test command:
TT_METAL_DPRINT_RISCVS=NC TT_METAL_DPRINT_CORES=0,0 TTNN_CONFIG_OVERRIDES='{"enable_fast_runtime_mode": false, "enable_logging": true}' pytest "tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(1, 1)-kernel_size=(3, 3)-act_shape=[1, 16, 1056, 160]-device_params={'l1_small_size': 24576}]" > hehe.in
Followed by:
grep -r "^[0-9]*: " hehe.in
will then show you what the reader kernel Dprinted as the first set of input sticks (the first 32 sticks, of width 16):
0: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
3: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
5: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
7: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
9: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10: 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
11: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
13: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
15: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16: 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
17: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18: 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
19: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20: 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
21: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
22: 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
23: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
24: 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
25: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
27: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
28: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
29: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
30: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
31: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Note that alternate sticks are skipped.
Executing the exact same on WH, will give you correct input print out, like:
0: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
2: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
3: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
4: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
5: 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
6: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
7: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
8: 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
9: 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
10: 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
11: 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
12: 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
13: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
14: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
15: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
16: 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
17: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
18: 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
19: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
20: 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
21: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
22: 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
23: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
24: 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
25: 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
26: 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
27: 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
28: 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
29: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
30: 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
31: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
The kernel on NCRISC that is printing the above is:
ttnn/cpp/ttnn/operations/data_movement/untilize_with_halo_v2/device/kernels/dataflow/halo_gather.cpp
@mywoodstock do you have an idea of the author of the halo reader kernel? Maybe we should add them for this debug and see if any assumptions were made that need to be fixed for blackhole
Yeah I'm the author 🙂
ah I apologize, alright are there any NOC functions in this halo_gather that have not been tested in the other ttnn passing sweeps? Also is the failure above deterministically showing the same result? I see this halo_gather does both a noc read & write in this kernel, if you try using constants in the kernel (not read from another location), do you see the same pattern with interleaved zeroes? Essentially we need to isolate if this is a timing issue, or an index miscalculation issue, or a noc memory address miscalculation issue.
@ncvetkovicTT @nvelickovicTT fyi for remaining BH maxpool bugs
Are there any passing cases with BFLOAT16? I’m not familiar with what channel depth 16 would mean for the op or kernel but can you try running with watcher to see if the reader is hitting potential noc alignment issue?
@abhullar-tt mentioned BH has 64 bytes DRAM alignment requirement. With depth 16, the alignment currently is 32b, so need to align to 64b.
OK, isolated the issue to interleaved_to_sharded and created this: https://github.com/tenstorrent/tt-metal/issues/12184 for @tarafdarTT
Some tests need to get updated to use blackhole grids, and ensure correct kernels are being used. For example:
Generates 2 different kernels for different devices.
For wormhole_b0 it generates:
with
per_core_block_cnt = 1
andper_core_block_tile_cnt=1
while for Blackhole it generates:
with
per_core_block_cnt = 1
andper_core_block_tile_cnt=32
Theper_core_block_tile_cnt=32
is incorrect, since this test is only a single 32x32 tile, but it generates incorrectly for Blackhole device. Seems like potential issue is from here:All potential tests that use the above kernels should be updated for Blackhole.
@tarafdarTT @ttmtrajkovic fyi