tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
475 stars 75 forks source link

Watcher catches bad access in RN50 on both GS and WH #6273

Closed mywoodstock closed 3 months ago

mywoodstock commented 8 months ago

Branch: asarje/alex/metal/bmm_large_block_untilize_out Test: TT_METAL_WATCHER=20 pytest tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet.py::test_resnet_50

Looks like the preceding matmul is clobbering config tensors for utwh.

acejkov commented 8 months ago

Few updates based on the experiments that were run with matmul conv1x1 untlize enabled:

  1. TTNN version of the resnet passes (branch asarje/ttnn-rn50-wh): pytest -ssv tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet.py::test_resnet_50[batch_size=8-act_dtype=DataType.BFLOAT16-weight_dtype=DataType.BFLOAT16-math_fidelity=MathFidelity.HiFi4]

  2. TTLib version of resnet hangs on grayskull pytest models/demos/resnet/tests/test_metal_resnet50.py::test_run_resnet50_inference[LoFi-activations_BFLOAT8_B-weights_BFLOAT8_B-batch_20]

Can be reproduced on main with the following patch?

diff --git a/models/demos/resnet/tt/metalResnetBlock50.py b/models/demos/resnet/tt/metalResnetBlock50.py
index 8fdc44805..0b0d9da03 100644
--- a/models/demos/resnet/tt/metalResnetBlock50.py
+++ b/models/demos/resnet/tt/metalResnetBlock50.py
@@ -1123,8 +1123,10 @@ class Bottleneck:
             fuse_relu=True,
             output_mem_config=self.sharded_memory_config,
             weights_dtype=model_config["WEIGHTS_DTYPE"],
-            output_dtype=model_config["ACTIVATIONS_DTYPE"],
+            #output_dtype=model_config["ACTIVATIONS_DTYPE"],
+            output_dtype=tt_lib.tensor.DataType.BFLOAT16,
             compute_kernel_config=compute_kernel_config,
+            untilize_out=True
         )

         self.conv2_params = [width, width, 3, 3, stride, stride, 1, 1, dilation, groups]
diff --git a/tt_eager/tt_lib/fused_ops/conv.py b/tt_eager/tt_lib/fused_ops/conv.py
index 67925a2de..018906908 100644
--- a/tt_eager/tt_lib/fused_ops/conv.py
+++ b/tt_eager/tt_lib/fused_ops/conv.py
@@ -89,6 +89,7 @@ def resnet50_1x1_conv_as_matmul(
     weights_dtype=None,
     output_dtype=None,
     compute_kernel_config=None,
+    untilize_out=False,
 ):
     """
     Returns a function that performs a Convolution. Bias is fused with matmul.
@@ -143,6 +144,7 @@ def resnet50_1x1_conv_as_matmul(
             output_mem_config=activation.memory_config() if output_mem_config is None else output_mem_config,
             output_dtype=output_dtype,
             compute_kernel_config=compute_kernel_config,
+            untilize_out=untilize_out,
         )

         return output

If writes to L1 are removed in halo_gather.cpp kernel test fails with pcc error but there is no hang

diff --git a/tt_eager/tt_dnn/op_library/untilize/kernels/dataflow/halo_gather.cpp b/tt_eager/tt_dnn/op_library/untilize/kernels/dataflow/halo_gather.cpp
index 107a6f7c8..336029397 100644
--- a/tt_eager/tt_dnn/op_library/untilize/kernels/dataflow/halo_gather.cpp
+++ b/tt_eager/tt_dnn/op_library/untilize/kernels/dataflow/halo_gather.cpp
@@ -5,6 +5,7 @@
 #include <stdint.h>

 #include "dataflow_api.h"
+#include "debug/dprint.h"

 // Fill an L1 buffer with the given val
 inline bool fill_with_val(uint32_t begin_addr, uint32_t n, uint16_t val) {
@@ -40,7 +41,7 @@ void copy_sticks_async(
             uint64_t dst_addr = dst_base_addr + dst_local_idx * stick_nbytes;
             uint32_t src_addr = in_base_l1_addr + src_local_idx * stick_nbytes;
             uint32_t size = nsticks * stick_nbytes;
-            noc_async_write(src_addr, dst_addr, size);
+            //noc_async_write(src_addr, dst_addr, size);
         }

         i += length;

Removing packer writes to L1 doesn't help which indicates that compute is probably not corrupting cb_config data in l1

mywoodstock commented 8 months ago

There are two separate nasty hang issues:

  1. In the TTNN RN50 model, there was a pre-processing padding that is missing (that is in TTLIB version). This causes the inferred tensor size to be larger than the actual sized CBs, hence spilling over onto the subsequent config tensors in L1. This was fixed by adding the padding step.
  2. The second one is in the matmul op, when untitlize_out is set to true. There are two subissues:
    • In this line, the partials CB is to be used as output is gated by additional condition of in1_num_subblocks > 1. But the kernel does not know about this and assumes the additional CB space to exist even when in1_num_subblobks == 1. Not fully sure if this is needed? Removing this condition is working fine (in addition to the following, next sub issue.)
    • When constructing the output tensor (when untilize_out is True, ie ROW_MAJOR), in this particular case, the shards are sized 160 high (Shard shape: [160 32], 5 tiles), but the tensor shape is not snapped to TILE_HEIGHT (Shape([1, 1, 1568, 256]). Hence, the actual memory allocated is 64 bytes smaller than the CB size causing the kernels to spills into the adjacent tensors
      CB 24 :: 10240 bytes
      CB 16 :: 10240 bytes

      As a "fix", I turned on the pad_to_same_tile_size option in create_sharded_device_tensor when untilize_out is true. This fixed the hang, but have some small PCC regression -- so need some proper fix :)

diff --git a/tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp b/tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp
index bb45f9358..d86fed0af 100644
--- a/tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp
+++ b/tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp
@@ -1054,7 +1054,7 @@ std::vector<Tensor> Matmul::create_output_tensors(const std::vector<Tensor>& inp
                     ShardSpec shard_spec = ShardSpec{all_cores, {per_core_M * TILE_HEIGHT, per_core_N * TILE_WIDTH}, shard_orientation};
                     auto mem_config = this->output_mem_config;
                     mem_config.shard_spec = shard_spec;
-                    return {create_sharded_device_tensor(this->compute_output_shapes(input_tensors).at(0), this->output_dtype, output_layout, input_tensor_a.device(), mem_config)};
+                    return {create_sharded_device_tensor(this->compute_output_shapes(input_tensors).at(0), this->output_dtype, output_layout, input_tensor_a.device(), mem_config, untilize_out)};
                 } else if constexpr (
                     std::is_same_v<ProgramConfigType, MatmulMultiCoreReuseProgramConfig>
                 ) {
mywoodstock commented 8 months ago

The debug process above was quite cumbersome and time consuming. Few things that helped (cc: @davorchap):

  1. The new ttnn.print_l1_buffers was crucial in giving indication on the config tensors that were getting corrupted.
  2. A way to inspect the actual contents of L1 is needed. What I did was simply use the L1 address of the config tensors, and print the section using DPRINT. This was crucial in root causing the second hang. A debug tool which could inspect l1 memory would be super useful.
mywoodstock commented 8 months ago

https://github.com/tenstorrent-metal/tt-metal/pull/6291 has the fix for the first hang. We are deprioritizing the second one since we don't really need to switch on untilize_out as it does not have much perf gain.

mywoodstock commented 8 months ago

@TT-BrianLiu tagging you as the issue is in matmul, and seems like you addressed similar issue with the pad_to_same_shard_size option, but it doesn't fully resolve this issue (bad pcc.)

TT-BrianLiu commented 8 months ago

@TT-BrianLiu tagging you as the issue is in matmul, and seems like you addressed similar issue with the pad_to_same_shard_size option, but it doesn't fully resolve this issue (bad pcc.)

@tarafdarTT Can you take a look and see if PCC is same issue with uneven shards we were seeing before with Falcon40B? pad_to_same_shard_size was a workaround for supporting uneven shards. It basically keeps the same behaviour (at least it should) as before when all shards are same size. Proper solution would be to have all sharding infra to be updated to be aware of uneven shards. In our testing, pad_to_same_shard_size seemed to work (pcc-wise) for all tests, but maybe we missed something.

mywoodstock commented 8 months ago

btw, this is low priority!

mywoodstock commented 3 months ago

no longer relevant. closing