tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
430 stars 58 forks source link

tt_lib.operations.primary.matmul with fp32 acc enabled gives subblock size error for interleaved inputs #6556

Closed mikevin920 closed 6 months ago

mikevin920 commented 6 months ago

Describe the bug With in0 and in1 both interleaved, using tt_lib.operation.primary.matmul with fp32_dest_acc enabled, gives this error Total number of tiles in a subblock must be less than 4 when in fp32_dest_acc mode" To Reproduce Steps to reproduce the behavior:

  1. create two tensors both interleaved in DRAM with the shapesttnn.Shape([1, 32, 128, 8192]) and ttnn.Shape([1, 1, 8192, 4096])
  2. Run tt_lib.operations.primary.matmul with this Code:
    
    tt_lib.operations.primary.matmul(
                    x[i],
                    self.w1_list[i],
                    compute_kernel_config=self.model_config["COMPUTE_KERNEL_CONFIG"],
                )

Config: "COMPUTE_KERNEL_CONFIG": ttl.tensor.WormholeComputeKernelConfig(

math_fidelity=ttl.tensor.MathFidelity.LoFi,

        math_fidelity=ttl.tensor.MathFidelity.HiFi2,
        math_approx_mode=True,
        fp32_dest_acc_en=True,
        packer_l1_acc=True,
    ),

**Expected behavior**
**Error Msg:**

def prefill_forward(self, x: tt_lib.tensor.Tensor) -> tt_lib.tensor.Tensor: hidden_states = [] w1_outs = [] w3_outs = [] for i in range(len(x)): """ (Pdb) x[i].memory_config() tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt) (Pdb) self.w1_list[i].memory_config() tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt) (Pdb) x[i].shape ttnn.Shape([1, 32, 128, 8192]) (Pdb) self.w1_list[i].shape ttnn.Shape([1, 1, 8192, 4096]) """ w1_outs.append(

          tt_lib.operations.primary.matmul(

x[i], self.w1_list[i], compute_kernel_config=self.model_config["COMPUTE_KERNEL_CONFIG"], ) ) E RuntimeError: TT_FATAL @ tt_eager/tt_dnn/op_library/bmm/multi_core_reuse_mcast_1d_optimized/bmm_op_multi_core_reuse_mcast_1d_optimized.cpp:1227: out_subblock_h * out_subblock_w <= 4 && "Total number of tiles in a subblock must be less than 4 when in fp32_dest_acc mode"

mikevin920 commented 6 months ago

@caixunshiren

jliangTT commented 6 months ago

@TT-BrianLiu , is this an LLK thing with the fp32 accum?

jliangTT commented 6 months ago

@cglagovich , are you working on this? (your name is assigned), if not, please unassign so we can track more proper owner

TT-BrianLiu commented 6 months ago

If accumulating with fp32 or in L1, max subblock size is 4. Issues like this really shouldn't be filed. The error message is not so much a bug as it is describing the feature. It should be at best a feature request.

mikevin920 commented 6 months ago

@TT-BrianLiu we did not specify the subblock_size for this fallback matmul,the automatically inferred one is wrong? Should this be a bug?

TT-BrianLiu commented 6 months ago

Ooh i see. I think we didn't have use cases where we didn't manually specify the program config when we specify the compute config. The logic to automatically infer mm params assume we are using default compute config.

TT-BrianLiu commented 6 months ago

It should be fairly straightforward to uplift to support this, but I would recommend you to just pass in your own program configs if you want to configure other things. I am hesitant to keep patching our more automatic matmuls because we should review if there will be other things that could affect program configs.

image