tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
478 stars 78 forks source link

[Bug Report] Wrong results in binary operators when broadcasting from tensor with the final two dimension is 1 #15359

Open marty1885 opened 11 hours ago

marty1885 commented 11 hours ago

Describe the bug With the recent tilization support of FP32 enabled and BFP16 no longer needing an even number of elements in the last dimension. I am able to enable a lot more tests in my GGML backend. However, I find unitests reporting significant numeric errors for adding and multipling tensors of shape [1, 1280, 1, 1] and [1, 1280, 16, 16] (nr is scaling factor, GGML stores shapes in reverse order).

F32 in GGML is emulated with BFP16 right now to due to unrelated reasons.

...
  DIV(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]): [ADD] NMSE = 0.945152196 > 0.000030000 FAIL
  MUL(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]): [MUL] NMSE = 0.005841628 > 0.000030000 FAIL
  DIV(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]): OK
  MUL(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]): [ADD] NMSE = 0.959706242 > 0.000030000 FAIL
  MUL(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]): [MUL] NMSE = 0.006204662 > 0.000030000 FAIL
  DIV(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]): [ADD] NMSE = 0.975900516 > 0.000030000 FAIL
  MUL(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]): [MUL] NMSE = 0.005488787 > 0.000030000 FAIL
  DIV(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]): [ADD] NMSE = 0.946392264 > 0.000030000 FAIL
  MUL(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]): [MUL] NMSE = 0.006580242 > 0.000030000 FAIL
  DIV(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]): [ADD] NMSE = 0.999944993 > 0.000030000 FAIL
  MUL(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]): [MUL] NMSE = 0.006642376 > 0.000030000 FAIL
  DIV(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]): [ADD] NMSE = 0.886648226 > 0.000030000 FAIL
  MUL(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]): [MUL] NMSE = 0.005970012 > 0.000030000 FAIL
  DIV(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]): not supported [Metalium] 
  ADD(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]): OK
  MUL(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]): OK
...

To Reproduce The following program is the minimal reproducible example in TTNN/C++ (also is what one of the testcases in the above GGML log).

#include <cstddef>
#include <ttnn/core.hpp>
#include <ttnn/distributed/api.hpp>
#include <ttnn/operations/creation.hpp>
#include <ttnn/operations/eltwise/binary/binary.hpp>
#include <ttnn/operations/data_movement/tilize_with_val_padding/tilize_with_val_padding.hpp>

#include <vector>
#include <iostream>

int main()
{
    auto device = &ttnn::open_device(0);
    auto a = ttnn::ones(ttnn::SimpleShape({1, 1280, 1, 1}), tt::tt_metal::DataType::BFLOAT16, tt::tt_metal::Layout::ROW_MAJOR).to(device);
    a = ttnn::tilize_with_zero_padding(a);

    auto b = ttnn::ones(ttnn::SimpleShape({1, 1280, 16, 16}), tt::tt_metal::DataType::BFLOAT16, tt::tt_metal::Layout::ROW_MAJOR).to(device);
    b = ttnn::tilize_with_zero_padding(b);

    auto res = ttnn::add(a, b);
    res = res.cpu();

    std::cerr << res.write_to_string() << std::endl;
}

Adding 2 tensors of all 1s together should product tensors of all 2s (and 0s because of the 0 padding). Instead it produces tensors of 2s and 1s.

                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 0
                  Metal | INFO     | MMIO Device 0 : Tunnel 0 : Device 4
ttnn.Tensor([[[[ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               [ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               ...,
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000]],

              [[ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               [ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               ...,
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000]],

              ...,

              [[ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               [ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               ...,
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000]],

              [[ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               [ 2.00000,  2.00000,  ...,  2.00000,  2.00000],
               ...,
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000]]]], shape=Shape([1, 1280, 16[32], 16[32]]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                 Device | INFO     | Closing user mode device drivers

Remarks:

Expected behavior Adding should work in all permitted broadcasting.

Screenshots If applicable, add screenshots to help explain your problem.

Please complete the following environment information:

Additional context Add any other context about the problem here.

eyonland commented 1 hour ago

Hey @marty1885 , thank you for the report. @yan-zaretskiy is actively working on fixing the broadcasting as the existing implementation has numerous issues.