[Bug Report] Wrong results in binary operators when broadcasting from tensor with the final two dimension is 1

Describe the bug With the recent tilization support of FP32 enabled and BFP16 no longer needing an even number of elements in the last dimension. I am able to enable a lot more tests in my GGML backend. However, I find unitests reporting significant numeric errors for adding and multipling tensors of shape [1, 1280, 1, 1] and [1, 1280, 16, 16] (nr is scaling factor, GGML stores shapes in reverse order).

F32 in GGML is emulated with BFP16 right now to due to unrelated reasons.

...
  DIV(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]): [ADD] NMSE = 0.945152196 > 0.000030000 FAIL
  MUL(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]): [MUL] NMSE = 0.005841628 > 0.000030000 FAIL
  DIV(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]): OK
  MUL(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]): [ADD] NMSE = 0.959706242 > 0.000030000 FAIL
  MUL(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]): [MUL] NMSE = 0.006204662 > 0.000030000 FAIL
  DIV(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]): [ADD] NMSE = 0.975900516 > 0.000030000 FAIL
  MUL(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]): [MUL] NMSE = 0.005488787 > 0.000030000 FAIL
  DIV(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]): [ADD] NMSE = 0.946392264 > 0.000030000 FAIL
  MUL(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]): [MUL] NMSE = 0.006580242 > 0.000030000 FAIL
  DIV(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]): [ADD] NMSE = 0.999944993 > 0.000030000 FAIL
  MUL(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]): [MUL] NMSE = 0.006642376 > 0.000030000 FAIL
  DIV(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]): [ADD] NMSE = 0.886648226 > 0.000030000 FAIL
  MUL(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]): [MUL] NMSE = 0.005970012 > 0.000030000 FAIL
  DIV(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]): OK
  MUL(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]): OK
...

To Reproduce The following program is the minimal reproducible example in TTNN/C++ (also is what one of the testcases in the above GGML log).

#include <cstddef>
#include <ttnn/core.hpp>
#include <ttnn/distributed/api.hpp>
#include <ttnn/operations/creation.hpp>
#include <ttnn/operations/eltwise/binary/binary.hpp>
#include <ttnn/operations/data_movement/tilize_with_val_padding/tilize_with_val_padding.hpp>

#include <vector>
#include <iostream>

int main()
{
    auto device = &ttnn::open_device(0);
    auto a = ttnn::ones(ttnn::SimpleShape({1, 1280, 1, 1}), tt::tt_metal::DataType::BFLOAT16, tt::tt_metal::Layout::ROW_MAJOR).to(device);
    a = ttnn::tilize_with_zero_padding(a);

    auto b = ttnn::ones(ttnn::SimpleShape({1, 1280, 16, 16}), tt::tt_metal::DataType::BFLOAT16, tt::tt_metal::Layout::ROW_MAJOR).to(device);
    b = ttnn::tilize_with_zero_padding(b);

    auto res = ttnn::add(a, b);
    res = res.cpu();

    std::cerr << res.write_to_string() << std::endl;
}

Adding 2 tensors of all 1s together should product tensors of all 2s (and 0s because of the 0 padding). Instead it produces tensors of 2s and 1s.

                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 0
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 4
ttnn.Tensor([[[[ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               [ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               ...,
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000]],

              [[ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               [ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               ...,
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000]],

              ...,

              [[ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               [ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               ...,
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000]],

              [[ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               [ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               ...,
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000]]]], shape=Shape([1, 1280, 16[32], 16[32]]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                 Device | INFO     | Closing user mode device drivers

Remarks:

Is happens with both FP32 and BFP16 tensors, as long as both the final 2 dimensions are 1
It seems fine if one of the dimensions is not 1.
- Adding [1, 1280, 16, 16] with [1, 1280, 16, 1] returns correct results

Expected behavior Adding should work in all permitted broadcasting.

Screenshots If applicable, add screenshots to help explain your problem.

Please complete the following environment information:

OS: Arch Linux
Card: N300
Commit: e7cd350de921a4aa3d5ee2b2a75ee34f9fb48f2a

Additional context Add any other context about the problem here.

tenstorrent / tt-metal

[Bug Report] Wrong results in binary operators when broadcasting from tensor with the final two dimension is 1 #15359