Closed mywoodstock closed 3 months ago
Few updates based on the experiments that were run with matmul conv1x1 untlize enabled:
TTNN version of the resnet passes (branch asarje/ttnn-rn50-wh): pytest -ssv tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet.py::test_resnet_50[batch_size=8-act_dtype=DataType.BFLOAT16-weight_dtype=DataType.BFLOAT16-math_fidelity=MathFidelity.HiFi4]
TTLib version of resnet hangs on grayskull pytest models/demos/resnet/tests/test_metal_resnet50.py::test_run_resnet50_inference[LoFi-activations_BFLOAT8_B-weights_BFLOAT8_B-batch_20]
Can be reproduced on main with the following patch?
diff --git a/models/demos/resnet/tt/metalResnetBlock50.py b/models/demos/resnet/tt/metalResnetBlock50.py
index 8fdc44805..0b0d9da03 100644
--- a/models/demos/resnet/tt/metalResnetBlock50.py
+++ b/models/demos/resnet/tt/metalResnetBlock50.py
@@ -1123,8 +1123,10 @@ class Bottleneck:
fuse_relu=True,
output_mem_config=self.sharded_memory_config,
weights_dtype=model_config["WEIGHTS_DTYPE"],
- output_dtype=model_config["ACTIVATIONS_DTYPE"],
+ #output_dtype=model_config["ACTIVATIONS_DTYPE"],
+ output_dtype=tt_lib.tensor.DataType.BFLOAT16,
compute_kernel_config=compute_kernel_config,
+ untilize_out=True
)
self.conv2_params = [width, width, 3, 3, stride, stride, 1, 1, dilation, groups]
diff --git a/tt_eager/tt_lib/fused_ops/conv.py b/tt_eager/tt_lib/fused_ops/conv.py
index 67925a2de..018906908 100644
--- a/tt_eager/tt_lib/fused_ops/conv.py
+++ b/tt_eager/tt_lib/fused_ops/conv.py
@@ -89,6 +89,7 @@ def resnet50_1x1_conv_as_matmul(
weights_dtype=None,
output_dtype=None,
compute_kernel_config=None,
+ untilize_out=False,
):
"""
Returns a function that performs a Convolution. Bias is fused with matmul.
@@ -143,6 +144,7 @@ def resnet50_1x1_conv_as_matmul(
output_mem_config=activation.memory_config() if output_mem_config is None else output_mem_config,
output_dtype=output_dtype,
compute_kernel_config=compute_kernel_config,
+ untilize_out=untilize_out,
)
return output
If writes to L1 are removed in halo_gather.cpp kernel test fails with pcc error but there is no hang
diff --git a/tt_eager/tt_dnn/op_library/untilize/kernels/dataflow/halo_gather.cpp b/tt_eager/tt_dnn/op_library/untilize/kernels/dataflow/halo_gather.cpp
index 107a6f7c8..336029397 100644
--- a/tt_eager/tt_dnn/op_library/untilize/kernels/dataflow/halo_gather.cpp
+++ b/tt_eager/tt_dnn/op_library/untilize/kernels/dataflow/halo_gather.cpp
@@ -5,6 +5,7 @@
#include <stdint.h>
#include "dataflow_api.h"
+#include "debug/dprint.h"
// Fill an L1 buffer with the given val
inline bool fill_with_val(uint32_t begin_addr, uint32_t n, uint16_t val) {
@@ -40,7 +41,7 @@ void copy_sticks_async(
uint64_t dst_addr = dst_base_addr + dst_local_idx * stick_nbytes;
uint32_t src_addr = in_base_l1_addr + src_local_idx * stick_nbytes;
uint32_t size = nsticks * stick_nbytes;
- noc_async_write(src_addr, dst_addr, size);
+ //noc_async_write(src_addr, dst_addr, size);
}
i += length;
Removing packer writes to L1 doesn't help which indicates that compute is probably not corrupting cb_config data in l1
There are two separate nasty hang issues:
matmul
op, when untitlize_out
is set to true. There are two subissues:
in1_num_subblocks > 1
. But the kernel does not know about this and assumes the additional CB space to exist even when in1_num_subblobks == 1
. Not fully sure if this is needed? Removing this condition is working fine (in addition to the following, next sub issue.)Shard shape: [160 32]
, 5 tiles), but the tensor shape is not snapped to TILE_HEIGHT (Shape([1, 1, 1568, 256]). Hence, the actual memory allocated is 64 bytes smaller than the CB size causing the kernels to spills into the adjacent tensors
CB 24 :: 10240 bytes
CB 16 :: 10240 bytes
As a "fix", I turned on the pad_to_same_tile_size
option in create_sharded_device_tensor
when untilize_out
is true
. This fixed the hang, but have some small PCC regression -- so need some proper fix :)
diff --git a/tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp b/tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp
index bb45f9358..d86fed0af 100644
--- a/tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp
+++ b/tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp
@@ -1054,7 +1054,7 @@ std::vector<Tensor> Matmul::create_output_tensors(const std::vector<Tensor>& inp
ShardSpec shard_spec = ShardSpec{all_cores, {per_core_M * TILE_HEIGHT, per_core_N * TILE_WIDTH}, shard_orientation};
auto mem_config = this->output_mem_config;
mem_config.shard_spec = shard_spec;
- return {create_sharded_device_tensor(this->compute_output_shapes(input_tensors).at(0), this->output_dtype, output_layout, input_tensor_a.device(), mem_config)};
+ return {create_sharded_device_tensor(this->compute_output_shapes(input_tensors).at(0), this->output_dtype, output_layout, input_tensor_a.device(), mem_config, untilize_out)};
} else if constexpr (
std::is_same_v<ProgramConfigType, MatmulMultiCoreReuseProgramConfig>
) {
The debug process above was quite cumbersome and time consuming. Few things that helped (cc: @davorchap):
ttnn.print_l1_buffers
was crucial in giving indication on the config tensors that were getting corrupted.https://github.com/tenstorrent-metal/tt-metal/pull/6291 has the fix for the first hang. We are deprioritizing the second one since we don't really need to switch on untilize_out
as it does not have much perf gain.
@TT-BrianLiu tagging you as the issue is in matmul, and seems like you addressed similar issue with the pad_to_same_shard_size
option, but it doesn't fully resolve this issue (bad pcc.)
@TT-BrianLiu tagging you as the issue is in matmul, and seems like you addressed similar issue with the
pad_to_same_shard_size
option, but it doesn't fully resolve this issue (bad pcc.)
@tarafdarTT Can you take a look and see if PCC is same issue with uneven shards we were seeing before with Falcon40B? pad_to_same_shard_size
was a workaround for supporting uneven shards. It basically keeps the same behaviour (at least it should) as before when all shards are same size. Proper solution would be to have all sharding infra to be updated to be aware of uneven shards. In our testing, pad_to_same_shard_size
seemed to work (pcc-wise) for all tests, but maybe we missed something.
btw, this is low priority!
no longer relevant. closing
Branch:
asarje/alex/metal/bmm_large_block_untilize_out
Test:TT_METAL_WATCHER=20 pytest tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet.py::test_resnet_50
Looks like the preceding matmul is clobbering config tensors for utwh.