[Blackhole Bringup] Update tests to blackhole grid/block sizes

rtawfik01 commented 3 weeks ago

Some tests need to get updated to use blackhole grids, and ensure correct kernels are being used. For example:

TT_METAL_DPRINT_CORES=0,0 pytest -svv tests/ttnn/unit_tests/test_to_layout.py::test_to_layout_2D[start_with_padding=False-to_layout=Layout.ROW_MAJOR-from_layout=Layout.TILE-on_device=True-width=32-height=32]

Generates 2 different kernels for different devices.

For wormhole_b0 it generates:

ttnn/cpp/ttnn/operations/data_movement/untilize/device/kernels/compute/pack_untilize.cpp

with per_core_block_cnt = 1 and per_core_block_tile_cnt=1

while for Blackhole it generates:

ttnn/cpp/ttnn/operations/data_movement/untilize/device/kernels/compute/untilize.cpp

with per_core_block_cnt = 1 and per_core_block_tile_cnt=32 The per_core_block_tile_cnt=32 is incorrect, since this test is only a single 32x32 tile, but it generates incorrectly for Blackhole device. Seems like potential issue is from here:

    std::string compute_kernel("ttnn/cpp/ttnn/operations/data_movement/untilize/device/kernels/compute/pack_untilize.cpp");
    if (ntiles_per_block > MAX_PACK_UNTILIZE_WIDTH || !use_pack_untilize) {
        log_debug(tt::LogOp, "Using slow untilize.");
        compute_kernel = std::string("ttnn/cpp/ttnn/operations/data_movement/untilize/device/kernels/compute/untilize.cpp");
    } else {
        log_debug(tt::LogOp, "Using fast pack untilize.");
    }

All potential tests that use the above kernels should be updated for Blackhole.

@tarafdarTT @ttmtrajkovic fyi

rtawfik01 commented 3 weeks ago

Adding another example of tests here:

tests/tt_eager/python_api_testing/unit_testing/misc/test_optimized_conv_v2.py::test_optimized_conv_v2

Fails with grid size failures:

python3: ../ttnn/cpp/ttnn/operations/conv/conv2d/device/multi_core_optimized_conv_sharded/optimized_conv_op_sharded_v2.cpp:456: operation::ProgramWithCallbacks ttnn::operations::conv::conv2d::multi_core_optimized_conv_sharded_v2_impl(tt_metal::Program &, const Tensor &, const Tensor &, const Shape &, std::optional<const Tensor>, const std::optional<const Tensor>, vector<int>, uint32_t, bool, bool, bool, const OptimizedConvParallelizationConfig &, const OptimizedConvBlockConfig &, uint32_t, bool, bool, Tensor &, DeviceComputeKernelConfig, bool, bool, bool): Assertion `num_cores_x < 13' failed.

rtawfik01 commented 3 weeks ago

More failures of BH grid size:

    def copy_output_from_device(self, output_d: ttnn.Tensor):
        interleaved_mem_config = ttnn.MemoryConfig(ttnn.TensorMemoryLayout.INTERLEAVED, ttnn.BufferType.DRAM)
>       output_d = ttl.tensor.sharded_to_interleaved(output_d, interleaved_mem_config)
E       RuntimeError: TT_FATAL @ ../ttnn/cpp/ttnn/deprecated/tt_dnn/op_library/sharded/sharded_op.cpp:33: (*input_tensor.memory_config().shard_spec).shape[1] * input_tensor.element_size() % (this->output_mem_config.buffer_type == BufferType::DRAM ? DRAM_ALIGNMENT : L1_ALIGNMENT) == 0
E       info:
E       Shard page size must be aligned to output buffer type alignment
E       backtrace:

Test:

pytest -svv tests/ttnn/unit_tests/operations/test_max_pool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-nblocks=1-dilation=(1, 1)-stride=(2, 2)-padding=(1, 1)-kernel_size=(3, 3)-act_shape=[8, 16, 528, 80]-device_params={'l1_small_size': 24576}] 2024-08-23 15:02:23.778 | DEBUG    | ttnn:manage_config:90 - Set ttnn.CONFIG.report_name to tests/ttnn/unit_tests/operations/test_max_pool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-nblocks=1-dilation=(1, 1)-stride=(2, 2)-padding=(1, 1)-kernel_size=(3, 3)-act_shape=[8, 16, 528, 80]-device_params={'l1_small_size': 24576}]

@mywoodstock fyi

mywoodstock commented 2 weeks ago

With the latest main, when I try the above maxpool test, I get a different error:

                 Always | FATAL    | Generated prefetcher command of size 131136 B exceeds max command size 131072 B
FAILED

Any idea about this @abhullar-tt ? I don't know much about prefetcher commands.

abhullar-tt commented 2 weeks ago

With the latest main, when I try the above maxpool test, I get a different error:
                 Always | FATAL    | Generated prefetcher command of size 131136 B exceeds max command size 131072 B
FAILED
Any idea about this @abhullar-tt ? I don't know much about prefetcher commands.

Fix is on https://github.com/tenstorrent/tt-metal/pull/11983

mywoodstock commented 2 weeks ago

Update on maxpool side:

All unit tests pass except the following 6 (they have channel depth 16) which get bad PCC:

FAILED tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(0, 0)-kernel_size=(2, 2)-act_shape=[4, 16, 1056, 160]-device_params={'l1_small_size': 24576}] - AssertionError: 0.0002867184106536119
FAILED tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(0, 0)-kernel_size=(2, 2)-act_shape=[8, 16, 528, 80]-device_params={'l1_small_size': 24576}] - AssertionError: 0.05038920860128919
FAILED tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(0, 0)-kernel_size=(2, 2)-act_shape=[16, 16, 528, 80]-device_params={'l1_small_size': 24576}] - AssertionError: -0.0006334264974595234
FAILED tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(1, 1)-kernel_size=(3, 3)-act_shape=[4, 16, 1056, 160]-device_params={'l1_small_size': 24576}] - AssertionError: 0.0007946853972681697
FAILED tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(1, 1)-kernel_size=(3, 3)-act_shape=[8, 16, 528, 80]-device_params={'l1_small_size': 24576}] - AssertionError: 0.035399027289953706
FAILED tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(1, 1)-kernel_size=(3, 3)-act_shape=[16, 16, 528, 80]-device_params={'l1_small_size': 24576}] - AssertionError: -0.00020153143760114889

Channel depth of 16 works when input is BFLOAT8_B, but fails with BFLOAT16, which is very sus -- i will try to see whats going on.

mywoodstock commented 2 weeks ago

Still trying to debug the maxpool PCC for these cases. Basically the exact same code on same number of cores on WH is all good, but BH is giving almost completely mismatching results. The output of halo itself is not correct, so maxpool kernel is probably fine. But since halo is a TM, not sure what can be going wrong -- perhaps some address calculation... will continue debugging, but @rtawfik01 if you have any potential suspects, lmk.

rtawfik01 commented 2 weeks ago

@mywoodstock can you post the compile time/runtime args that are passed to the compute kernels themselves? There might be a codepath that was not exercised?

mywoodstock commented 2 weeks ago

Here is the latest. The following happens only with BFLOAT16 input data with channel depth of 16:

The output of halo op looked wrong, but then when i tried to print the input pages themselves, I get wrong data from the reader kernel itself. This is strange because printing the input tensor before the halo op shows everything is in order. The exact same code/kernel on WH works just fine, and the input data is all good there.

@abhullar-tt @rtawfik01 may be you have an idea of something that could go wrong on BH. Below are the steps to repro the observation through a test case:

Branch:

asarje/debug-halo-bh

Test command:

TT_METAL_DPRINT_RISCVS=NC TT_METAL_DPRINT_CORES=0,0 TTNN_CONFIG_OVERRIDES='{"enable_fast_runtime_mode": false, "enable_logging": true}' pytest "tests/ttnn/unit_tests/operations/test_maxpool2d.py::test_run_max_pool[dtype=DataType.BFLOAT16-dilation=(1, 1)-stride=(2, 2)-padding=(1, 1)-kernel_size=(3, 3)-act_shape=[1, 16, 1056, 160]-device_params={'l1_small_size': 24576}]" > hehe.in

Followed by:

grep -r "^[0-9]*: " hehe.in

will then show you what the reader kernel Dprinted as the first set of input sticks (the first 32 sticks, of width 16):

0: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
3: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
5: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
7: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
9: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10: 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
11: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
13: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
15: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16: 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
17: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18: 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
19: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20: 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
21: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
22: 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
23: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
24: 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
25: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
27: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
28: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
29: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
30: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
31: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Note that alternate sticks are skipped.

Executing the exact same on WH, will give you correct input print out, like:

0: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
2: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
3: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
4: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
5: 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
6: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
7: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
8: 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
9: 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
10: 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
11: 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
12: 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
13: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
14: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
15: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
16: 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
17: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
18: 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
19: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
20: 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
21: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
22: 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
23: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
24: 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
25: 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
26: 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
27: 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
28: 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
29: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
30: 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
31: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

The kernel on NCRISC that is printing the above is:

ttnn/cpp/ttnn/operations/data_movement/untilize_with_halo_v2/device/kernels/dataflow/halo_gather.cpp

rtawfik01 commented 1 week ago

@mywoodstock do you have an idea of the author of the halo reader kernel? Maybe we should add them for this debug and see if any assumptions were made that need to be fixed for blackhole

mywoodstock commented 1 week ago

Yeah I'm the author 🙂

rtawfik01 commented 1 week ago

ah I apologize, alright are there any NOC functions in this halo_gather that have not been tested in the other ttnn passing sweeps? Also is the failure above deterministically showing the same result? I see this halo_gather does both a noc read & write in this kernel, if you try using constants in the kernel (not read from another location), do you see the same pattern with interleaved zeroes? Essentially we need to isolate if this is a timing issue, or an index miscalculation issue, or a noc memory address miscalculation issue.

rtawfik01 commented 1 week ago

@ncvetkovicTT @nvelickovicTT fyi for remaining BH maxpool bugs

abhullar-tt commented 1 week ago

Are there any passing cases with BFLOAT16? I’m not familiar with what channel depth 16 would mean for the op or kernel but can you try running with watcher to see if the reader is hitting potential noc alignment issue?

mywoodstock commented 1 week ago

@abhullar-tt mentioned BH has 64 bytes DRAM alignment requirement. With depth 16, the alignment currently is 32b, so need to align to 64b.

mywoodstock commented 1 week ago

OK, isolated the issue to interleaved_to_sharded and created this: https://github.com/tenstorrent/tt-metal/issues/12184 for @tarafdarTT

tenstorrent / tt-metal

[Blackhole Bringup] Update tests to blackhole grid/block sizes #11684