tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
396 stars 49 forks source link

[Bug Report] Garbage in tt::tt_metal::Tensor::unpad result #11082

Open marty1885 opened 1 month ago

marty1885 commented 1 month ago

Describe the bug

tt::tt_metal::Tensor::unpad Should unpad the tensor on CPU. However I found that it results in partial garbage data in commit 5aa33ef359453c04e519c9ffa29c4ac5815c4fc9. I am using this function as a fallback in my GGML backend to view into tensors.

To Reproduce Steps to reproduce the behavior:

  1. Compile the following code
  2. Run it
  3. Observe the output
#include <cstddef>
#include <ttnn/device.hpp>
#include <ttnn/operations/data_movement/tilize_with_val_padding/tilize_with_val_padding.hpp>

#include "common/bfloat16.hpp"
#include "tt_dnn/op_library/auto_format.hpp"
#include "ttnn/operations/data_movement/permute/permute.hpp"
#include <tt_metal/detail/persistent_kernel_cache.hpp>
#include "ttnn/tensor/tensor.hpp"
#include "ttnn/tensor/types.hpp"

#include <vector>
#include <iostream>

ttnn::device::Device* device = nullptr;

static tt::tt_metal::Tensor make_random_tensor(tt::tt_metal::Shape s)
{
    static int seed = 42;
     auto b = tt::tt_metal::owned_buffer::create(
        create_random_vector_of_bfloat16_native(
        s[0] * s[1] * s[2] * s[3] * 2
            , 2, seed++, -1));
    tt::tt_metal::Tensor t(OwnedStorage{std::move(b)}, s
        , tt::tt_metal::DataType::BFLOAT16, tt::tt_metal::Layout::ROW_MAJOR);
    return ttnn::tilize_with_zero_padding(t.to(AutoFormat::GetDefaultDevice()));
}

void dump_first_tile_of_tensor(tt::tt_metal::Tensor tensor)
{
    assert(tensor.dtype() == tt::tt_metal::DataType::BFLOAT16);
    auto t = tensor;
    if(t.storage_type() == tt::tt_metal::StorageType::DEVICE)
        t = t.cpu();
    if(t.layout() != tt::tt_metal::Layout::ROW_MAJOR)
        t = t.to(tt::tt_metal::Layout::ROW_MAJOR);

    // This is shorter to write. Don't care about performance
    t = t.to(AutoFormat::GetDefaultDevice());
    std::vector<bfloat16> buf(t.shape().with_tile_padding().volume());
    memcpy(buf.data(), t);

    for(int y = 0; y < 32; y++) {
        for(int x = 0; x < 32; x++) {
            if(y >= t.shape()[2] || x >= t.shape()[3])
                std::cout << "0 ";
            else
                std::cout << buf[y*32+x].to_float() << " ";
        }
        std::cout << "\n";
    }
    std::cout << "\n";
}

int main()
{
    device = &ttnn::device::open_device(0);
    AutoFormat::SetDefaultDevice(device);
    ttnn::enable_program_cache(*device);
    tt::tt_metal::detail::EnablePersistentKernelCache();

    auto a = make_random_tensor({1, 1, 10, 10});

    Shape start(std::vector<uint32_t>{0, 0, 0, 0});
    Shape end(std::vector<uint32_t>{0, 0, 5, 5});
    auto b = a.cpu().to(tt::tt_metal::Layout::ROW_MAJOR).unpad(start, end);

    std::cout << "A:\n";
    dump_first_tile_of_tensor(a);
    std::cout << "B:\n";
    dump_first_tile_of_tensor(b);

    device->close();
}

Output

                 Device | INFO     | Opening user mode device driver
2024-08-05 09:22:35.450 | INFO     | SiliconDriver   - Detected 1 PCI device : [0]
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | Running with 1 cqs 
                  Metal | INFO     | AI CLK for device 0 is:   1202 MHz
                  Metal | INFO     | Enabling program cache on device 0
                  Verif | INFO     | Created a random vector of size 100
A:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0.197266 0.193359 -0.6875 -0.10791 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 0.202148 -0.710938 0.416016 0.300781 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 -0.574219 -0.996094 -0.632812 0.984375 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 -0.135742 -0.953125 -0.416016 0.0493164 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 -0.265625 -0.53125 -0.0874023 -0.816406 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.570312 0.236328 -0.597656 -0.234375 0.0284424 0.964844 0.18457 -0.0664062 -0.90625 0.71875 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.214844 0.359375 -0.65625 -0.0986328 -0.867188 -0.972656 0.894531 0.882812 0.929688 0.125977 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.613281 -0.228516 -0.390625 -0.964844 -0.800781 -0.535156 0.367188 -0.515625 -0.119629 0.365234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.753906 0.219727 -0.00964355 0.664062 -0.929688 -0.652344 0.816406 -0.217773 -0.482422 -0.632812 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.324219 0.507812 -0.375 -0.149414 0.0400391 -0.582031 0.0932617 0.134766 -0.628906 -0.933594 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

B:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.597656 -0.234375 0.0284424 0.964844 7.43868e-39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
5.51013e-40 0 0 0 8.26519e-40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 8.26519e-40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-4.00741e-28 2.22514e+20 4.59177e-40 0 1.83634e-22 -6.77626e-21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                 Device | INFO     | Closing user mode device drivers

Expected behavior

The result tensor should be correct and contains no garbage.

Screenshots If applicable, add screenshots to help explain your problem.

Please complete the following environment information:

Additional context I did another run without manually masking out the padded part of the tensor, thinking maybe that's a part of the bug. I got the following

(python_env) marty@benderv2:~/Documents/ttnn-helloworld-cpp/build$ make && ./ttnn-hello 
[ 33%] Building CXX object CMakeFiles/ttnn-hello.dir/ttnn-hello.cpp.o
[ 66%] Linking CXX executable ttnn-hello
[100%] Built target ttnn-hello
                 Device | INFO     | Opening user mode device driver
2024-08-05 09:20:32.195 | INFO     | SiliconDriver   - Detected 1 PCI device : [0]
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | Running with 1 cqs 
                  Metal | INFO     | AI CLK for device 0 is:   1202 MHz
                  Metal | INFO     | Enabling program cache on device 0
                  Verif | INFO     | Created a random vector of size 100
A:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0.197266 0.193359 -0.6875 -0.10791 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 0.202148 -0.710938 0.416016 0.300781 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 -0.574219 -0.996094 -0.632812 0.984375 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 -0.135742 -0.953125 -0.416016 0.0493164 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 -0.265625 -0.53125 -0.0874023 -0.816406 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.570312 0.236328 -0.597656 -0.234375 0.0284424 0.964844 0.18457 -0.0664062 -0.90625 0.71875 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.214844 0.359375 -0.65625 -0.0986328 -0.867188 -0.972656 0.894531 0.882812 0.929688 0.125977 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.613281 -0.228516 -0.390625 -0.964844 -0.800781 -0.535156 0.367188 -0.515625 -0.119629 0.365234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.753906 0.219727 -0.00964355 0.664062 -0.929688 -0.652344 0.816406 -0.217773 -0.482422 -0.632812 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.324219 0.507812 -0.375 -0.149414 0.0400391 -0.582031 0.0932617 0.134766 -0.628906 -0.933594 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

B:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 -0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 -0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 -0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 0.570312 0.236328 
-0.597656 -0.234375 0.0284424 0.964844 7.43868e-39 0 0 0 9.18355e-41 0 4.59177e-40 0 1.52118e+32 -2.78307e+36 3.39474e+13 0 0 0 0 0 0 0 0 0 5.51013e-40 0 0 0 8.26519e-40 0 0 0 
5.51013e-40 0 0 0 8.26519e-40 0 0 0 0 0 0 0 7.43868e-39 0 0 0 9.18355e-41 0 4.59177e-40 0 2.94095e+32 -2.78307e+36 3.39474e+13 0 0 0 0 0 0 0 0 0 
0 0 0 0 8.26519e-40 0 0 0 0 0 0 0 8.26519e-40 0 0 0 0 0 0 0 7.43868e-39 0 0 0 -0.000182152 -1.2902e-17 3.32602e+13 0 6.07485e+25 1.06496e+06 1.19114e-21 -5.23986e+11 
0 0 0 0 0 0 0 0 -5.67907e+32 -2.88692e+36 3.39474e+13 0 -5.8819e+32 -2.88692e+36 3.39474e+13 0 -5.8819e+32 -2.88692e+36 3.39474e+13 0 9.18355e-41 0 0 0 0 0 0 0 2.36568e-37 0 0 0 
-8.9375 1.83747e+19 4.59177e-40 0 6.07485e+25 1.06496e+06 1.19114e-21 -5.23986e+11 1.18261e-24 -nan 5.91305e-25 -nan 5.6552e-16 1.5612e-39 3.5345e-17 1.18468e-38 2.20906e-18 1.33161e-38 1.38066e-19 2.9571e-38 -2.21181e-34 -1.625e-33 1.10591e-34 -1.625e-33 -6.83103e-34 2.1214e-38 -1.38066e-19 6.42848e-40 -2.21181e-34 6.52032e-39 -2.95868e+29 -2.96418e+38 
3.69566e-26 -nan 5.91305e-25 -nan 2.21181e-34 -1.60093e-33 1.10591e-34 -1.625e-33 1.72092e-35 7.34684e-39 -1.85584e+33 0 -5.52953e-35 -4.24663e+31 1.34665e-34 7.71875 -1.68331e-35 1.68059e-38 1.61628e+14 3.58158e-39 -1.68331e-35 1.9561e-38 -5.52953e-35 4.59177e-40 2.76476e-35 0 6.91191e-36 -1.60093e-33 1.44953e-29 -nan 2.43739e+11 2.31337e-35 
-1.19144e-10 5.51013e-40 -5.95719e-11 6.42848e-39 -2.9786e-11 1.2306e-38 -7.10152e-18 1.81834e-38 -1.45439e-14 2.46119e-38 -7.27196e-15 3.63669e-38 2.76476e-35 9.36722e-39 -5.52953e-35 4.92238e-38 -1.38066e-19 8.4856e-38 -2.20906e-18 7.89785e-38 -3.5345e-17 7.31011e-38 -5.6552e-16 9.45906e-39 -9.04832e-15 2.77343e-38 -1.44773e-13 2.47956e-38 -2.21181e-34 4.95912e-38 -2.95868e+29 -3.3023e+36 
1.52337e+10 3.37955e-37 9.49965e-25 3.48975e-39 -6.73323e-35 2.24997e-38 -9.08995e-16 5.51013e-40 2.69329e-34 2.24997e-38 -1.38066e-19 1.97446e-38 2.76476e-35 2.02038e-39 -2.95868e+29 -7.41045e+37 9.23914e-27 -nan 1.84783e-26 -nan 1.72798e-36 -1.58889e-33 -3.45595e-36 -1.58889e-33 2.94663e+35 1.35917e-38 3.63598e-15 3.67342e-40 1.72798e-36 6.24481e-39 -2.1214e-38 6.42848e-40 
-2.95868e+29 -2.11347e+38 5.07927e-15 -nan -3.5345e-17 2.53907e-35 2.75348e-34 -nan -2.21181e-34 1.14349e-19 -9.08995e-16 1.23978e-38 1.37674e-34 -nan 8.60462e-36 -nan 5.50696e-34 -nan 3.5345e-17 1720 -1.81799e-15 1.98365e-37 1.72092e-35 -nan 9.68794e-21 -nan 4.5113e-30 -nan 1.38066e-19 1832 -1.81799e-15 6.52032e-39 
1.66438e-10 0 7.75035e-20 0 6.91191e-36 532 2.31637e-12 116 -9.08995e-16 6.42848e-40 1.37674e-34 -nan 4.42362e-34 272 -1.10139e-33 -nan 9.04832e-15 107.5 -1.81799e-15 2.92404e-37 2.75348e-34 -nan -1.38238e-35 12608 1.81191e-30 -1.06829e-34 -3.5345e-17 115.5 2.75348e-34 -nan -3.63598e-15 3184 
-3.62383e-30 1.2857e-39 -4.54498e-16 1.23978e-38 -7.10152e-18 6.52032e-39 -3.55076e-18 6.42848e-40 7.24766e-30 2.10649e-33 2.75348e-34 -nan -1.44953e-29 -1.37753e-39 8.63988e-37 1.46937e-39 5.26034e-37 9.36722e-39 1.84164e+34 1.77186e-31 -1.81799e-15 1.83671e-38 4.74983e-25 2.47956e-38 1.61628e+14 2.47956e-38 -9.08995e-16 2.49793e-38 -2.37491e-25 2.46119e-38 -8.08141e+13 2.46119e-38 
-2.18593e-32 -1.52861e+38 2.98151e+33 1.08366e-38 1.48432e-26 3.58158e-39 8.08141e+13 4.95912e-39 4.20827e-36 1.451e-38 -5.68122e-17 3.67342e-40 -2.37491e-25 1.67141e-38 4.74983e-25 2.11222e-39 -2.84061e-17 4.50622e-19 -6.73323e-35 7.89785e-39 -1.34665e-34 2.18568e-38 -6.73323e-35 1.37753e-38 4.50113e+12 6.42848e-40 7.4216e-27 6.24481e-39 -2.90878e-14 1.98365e-37 2.40346e-20 -4.55261e+37 
-3.70619e-11 8.4856e-38 -1.38066e-19 1.23978e-38 -5.6552e-16 1.97446e-38 -9.04832e-15 1.38672e-38 -1.38066e-19 1.45174e-36 -3.5345e-17 1.81467e-37 1.81799e-15 4.59177e-40 1.10591e-34 2.11222e-39 1.38066e-19 2.10303e-38 1.81799e-15 7.34684e-40 1.10591e-34 2.11222e-39 1.38066e-19 2.13058e-38 -9.08995e-16 1.6856e-19 2.40346e-20 -1.42892e+38 -1.81799e-15 1.83671e-38 -5.52953e-35 1.12655e-19 
1.10591e-34 2.11222e-39 -5.6552e-16 2.13058e-38 -2.84061e-17 2.46119e-38 -1.81799e-15 1.2306e-38 1.34665e-34 1.23978e-38 -9.04832e-15 2.11222e-38 -9.08995e-16 6.42848e-39 2.93391e-24 1.97446e-38 -9.08995e-16 1.12655e-19 1.34665e-34 6.71875 -9.04832e-15 3.87941e-19 2.76476e-35 2.02038e-39 -2.95868e+29 -1.76679e+31 -2.15997e-37 1.83671e-40 2.21923e-19 1.77243e-38 2.84061e-17 1.18468e-38 
5.68122e-17 5.96931e-39 2.90878e-14 9.18355e-41 2.69996e-38 2.36936e-38 -9.45906e-39 0 2.53964e-15 0 2.75348e-34 -nan 4.36344e-34 0 0 0 4.43749e-37 0 9.45906e-39 1.79079e-38 -9.45906e-39 0 2.69996e-38 -4.28676e+37 8.60462e-36 -nan 2.75348e-34 -nan 3.70619e-11 1.5612e-39 2.31637e-12 1.18468e-38 
1.44773e-13 1.33161e-38 2.76476e-35 33.25 -2.21181e-34 33.75 1.09297e-32 2.25915e-38 -3.36662e-35 6.21875 1.38238e-35 0 6.91191e-36 33.25 -5.91735e+29 -4.80548e+25 2.75348e-34 -nan 9.0226e-30 -nan -6.91191e-36 33.75 7.24766e-30 35.75 8.82538e-30 5.46875 1.72092e-35 7.34684e-39 2.53964e-15 9.18355e-41 1.10591e-34 -4.27832e+31 
1.05925e+19 4.31627e-39 1.68331e-35 1.66222e-38 -1.68331e-35 2.10303e-38 1.44953e-29 -nan -2.21181e-34 33.75 2.76476e-35 0 -1.10591e-34 4.59177e-40 2.43739e+11 2.25695e-35 5.95719e-11 6.42848e-40 2.9786e-11 6.52032e-39 7.10152e-18 1.23978e-38 1.45439e-14 1.82753e-38 7.27196e-15 2.47956e-38 9.08995e-16 3.65505e-38 2.76476e-35 9.36722e-39 -1.38066e-19 7.89785e-38 
-2.20906e-18 7.31011e-38 -3.5345e-17 9.45906e-39 -5.6552e-16 2.77343e-38 -9.04832e-15 2.47956e-38 -1.44773e-13 1.97446e-38 1.10591e-34 4.95912e-38 -2.21181e-34 4.95912e-38 113.5 -4.08738e+37 4.08927e+18 8.44887e-38 9.49965e-25 3.48975e-39 3.36662e-35 6.46875 1.18746e-25 3.48975e-39 -1.68331e-35 2.24997e-38 8.41654e-36 2.24997e-38 2.19487e-38 3 4.30231e-36 -nan 
2.15115e-36 -nan -3.45595e-36 33 1.72798e-36 33 2.94663e+35 1.35917e-38 3.63598e-15 3.67342e-40 1.72798e-36 6.24481e-39 -2.1214e-38 6.42848e-40 1.37213e+26 -9.83629e+37 5.07927e-15 -nan -3.5345e-17 2.53907e-35 2.75348e-34 -nan 1.37674e-34 -nan -3.63598e-15 2.47956e-38 1.10591e-34 3.65505e-38 1.13624e-16 2.28699e-19 6.88369e-35 -nan 
-2.66837e-36 6.42848e-40 -5.52953e-35 3.04681e-11 2.75348e-34 1.83671e-40 -1.38238e-35 3.58158e-39 1.47826e-25 0 -1.38066e-19 2.67425e-37 1.38238e-35 -5.51013e-40 -2.20906e-18 2.67425e-37 -9.04832e-15 1.0323e-31 -1.38066e-19 1.08244e-25 -1.38066e-19 6.76527e-27 -9.04832e-15 4.43369e-22 -2.21181e-34 1.03766e+13 -2.31637e-12 3.79023e-31 -3.5345e-17 3.97434e-25 -3.5345e-17 2.48397e-26 
-2.31637e-12 1.62789e-21 1.72092e-35 3.67342e-40 2.75348e-34 -nan -1.38066e-19 2.81956e-31 5.50696e-34 1.74487e-39 -2.20906e-18 4.16001e-31 5.50696e-34 3.12241e-39 -3.5345e-17 4.16001e-31 5.50696e-34 4.49994e-39 -5.6552e-16 4.16001e-31 -9.04832e-15 2.81956e-31 5.50696e-34 1.01019e-39 -1.44773e-13 4.16001e-31 5.50696e-34 1.65304e-39 -2.31637e-12 4.16001e-31 5.50696e-34 2.29589e-39 
-3.70619e-11 4.16001e-31 -1.38066e-19 4.5113e-30 1.72092e-35 6.42848e-40 -2.20906e-18 4.5113e-30 1.72092e-35 9.18355e-40 -3.5345e-17 4.5113e-30 1.72092e-35 1.19386e-39 -5.6552e-16 4.5113e-30 2.76476e-35 2.46119e-38 -1.38066e-19 7.44793e-35 -2.20906e-18 7.44793e-35 -3.5345e-17 7.44793e-35 -5.6552e-16 7.44793e-35 3.44185e-35 9.18355e-41 2.76476e-35 -nan -9.04832e-15 4.65496e-36 
-1.38066e-19 1.57516e-36 -2.20906e-18 1.57516e-36 -3.5345e-17 1.57516e-36 -5.6552e-16 1.57516e-36 -1.38066e-19 3.92615e-36 -2.20906e-18 1.58692e-36 -3.5345e-17 1.58692e-36 -5.6552e-16 1.58692e-36 2.20906e-18 9.18355e-41 -3.63598e-15 4.67847e-36 1.72092e-35 -nan 2.20906e-18 2.21324e-38 -4.54498e-16 6.42848e-39 -3.63598e-15 1.2306e-38 -1.38238e-35 -nan 9.07638e+32 1.67141e-38 
1.72092e-35 -nan -7.2611e+33 1.68059e-38 -2.20906e-18 1.81834e-38 -3.5345e-17 2.25915e-38 -1.59694e+29 1.66153e+35 3.44185e-35 -nan 3.63598e-15 1.16374e-36 -5.96303e+33 2.43561e-29 -3.49749e-31 -4.48614e+37 -2.27249e-16 5.51013e-40 2.75348e-34 6.16298e-33 5.50696e-34 6.77626e-21 6.91191e-36 1.46937e-39 -91648 2.24997e-38 -7.2611e+33 2.47956e-38 -91648 2.24997e-38 
-1.38066e-19 2.25915e-38 3.70619e-11 2.43915e-37 2.46119e-38 -176128 7.49399e+19 -181248 3.5345e-17 9.18355e-41 9.08995e-16 1.18468e-38 2.75348e-34 -nan 3.5345e-17 1.96528e-38 9.08995e-16 1.2306e-38 1.72092e-35 9.40395e-38 3.5345e-17 1.91936e-38 -1.38066e-19 1.53365e-38 5.52953e-35 2.35099e-38 -2.20906e-18 1.97446e-38 -3.5345e-17 1.68059e-38 -5.6552e-16 1.68059e-38 
2.95652e-25 8 -9.04832e-15 1.68059e-38 -5.52953e-35 -6.63536e+29 -1.44773e-13 1.97446e-38 -2.95652e-25 8 -2.31637e-12 1.68059e-38 -1.38238e-35 -1.85691e+29 -3.70619e-11 1.68059e-38 -5.52953e-35 -1.65884e+29 -1.38066e-19 3.15914e-37 5.6552e-16 9.18355e-41 3.63598e-15 1.77243e-38 1.72092e-35 -nan 2.20906e-18 2.25915e-38 3.63598e-15 6.42848e-39 -1.38066e-19 4.59177e-40 
5.6552e-16 2.21324e-38 8.30193e-38 -3.88195e-34 6.88369e-35 2048 2.75348e-34 -nan -5.52953e-35 1.2306e-38 -1.38066e-19 1.97446e-38 2.95652e-25 2048 -5.52953e-35 1.52447e-38 -1.38066e-19 1.97446e-38 1.26982e-15 2048 -5.52953e-35 1.81834e-38 -1.38066e-19 1.97446e-38 5.45382e-06 2048 -5.52953e-35 2.11222e-38 -1.38066e-19 1.97446e-38 6.88369e-35 2080 
-5.52953e-35 2.46119e-38 -1.38066e-19 1.97446e-38 2.95652e-25 2080 -5.52953e-35 3.04894e-38 -1.38066e-19 1.97446e-38 1.26982e-15 2080 -5.52953e-35 3.63669e-38 -1.38066e-19 1.97446e-38 5.45382e-06 2080 -5.52953e-35 4.22443e-38 -1.38066e-19 1.97446e-38 -9.08995e-16 4.59177e-40 -2.75348e-34 -4.94765e-10 -2.21181e-34 1.9163e-32 -1.83369e-25 5.51013e-40 -2.75348e-34 -4.80213e-10 
-2.21181e-34 1.9163e-32 6.88369e-35 -nan -1.38066e-19 2.25915e-38 4.1326e-39 8192 5.60197e-39 1.37439e+11 9.00225e+12 1.13164e-18 3.6009e+13 3.37119e-19 -6.88369e-35 -nan -3.63598e-15 1.81834e-38 2.69329e-34 2.25915e-38 -9.49965e-25 2.47956e-38 -3.23256e+14 2.47956e-38 -1.36621e-33 -1.10991e+38 3.48975e-39 -1.66729e-24 9.08995e-16 1.34848e-18 3.63598e-15 6.74238e-19 
3.44185e-35 1.82536e+10 -2.69329e-34 1.97446e-38 -2.21181e-34 -nan 6.88369e-35 -nan -1.38066e-19 1.81834e-38 -9.49965e-25 2.47956e-38 3.44185e-35 2048 -3.23256e+14 1.23978e-38 2.76476e-35 4.92238e-38 -3.36662e-35 1.82753e-38 -1.38066e-19 1.81834e-38 6.34909e-16 2048 2.76476e-35 6.09788e-38 -3.36662e-35 1.82753e-38 -1.38066e-19 1.81834e-38 11712 2048 
2.76476e-35 7.27337e-38 -3.36662e-35 1.82753e-38 -1.38066e-19 1.81834e-38 2.16048e+23 2048 2.76476e-35 8.44887e-38 -2.69329e-34 1.82753e-38 -1.38066e-19 2.25915e-38 8.99988e-39 -1.54074e-33 1.96528e-38 -2.75 3.63669e-38 -2.8125 4.77545e-38 -1.54074e-33 4.22443e-38 -2.875 7.86112e-38 -2.9375 0 2.40741e-35 8.30193e-38 -7.70372e-34 -3.63598e-15 4.59177e-40 

                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                 Device | INFO     | Closing user mode device drivers
tarafdarTT commented 1 month ago

Hey @marty1885 ,

this is a little strange.

A couple notes, we have recently uplifted unpad and its now ttnn::slice(<input tensor>, start, end) and if the tensor is a host tensor it will do the same as a host side unpad as we had previously, (if its device tensor it will do a device slice/unpad).

I have a python example here and verified that it works: https://github.com/tenstorrent/tt-metal/blob/eb1d9a9f2f1e10d811a27719486c8a30d5f792d4/tests/ttnn/unit_tests/operations/test_host_slice.py

This shows the underlying unpad/slice API works on host. However if you're still having problems with the latest codebase with ttnn::slice then it's a C++ API issue . I can then look at that. Do you mind giving ttnn::slice a shot first and let me know , then I can proceed to debug this with you :)

marty1885 commented 1 month ago

Hi @tarafdarTT ,

I am aware of the API. unfortunately I cannot always use ttnn::slice. GGML's tests often asks for non tile-aligned view into a tensor, mostly because test tensors are small. But slice requires both coordinates to be tile aligned. This is my current code.

I suspect there's more to this bug. A lot of OPs is failing with non tile-aligned tensors on my side. But I can't be sure as unpad is used a lot in testing. So far MatMul, hardswich, transpose looks likely be malfunctioning too.

if(dst_size[0] % tt::constants::TILE_WIDTH == 0 && dst_size[1] % tt::constants::TILE_HEIGHT == 0 &&
    start[2] % tt::constants::TILE_WIDTH == 0 && start[3] % tt::constants::TILE_HEIGHT == 0) {
    res = ttnn::slice(*parent, start, end);
}
else {
    // THIS is EXTREMELY SLOW. But it works
    tt::tt_metal::Tensor tmp = parent->cpu().to(tt::tt_metal::Layout::ROW_MAJOR).unpad(start, end);
    res = ttnn::tilize_with_zero_padding(tmp.to(bufctx->device));
}

If possible, I strongly prefer getting unpad working again. Otherwise I loose a lot of tests.

=====

Edit: Sorry I misread your commend. ttnn::slice is working normally for me. Only unpad is broken.

marty1885 commented 1 month ago

Hi, I was messing around. And I noticed tensor.volume() no longer returns the padded volume. Instead it returns the non-padded one. Could this be related to this issue? So now non tile-aligned strides and sizes are off?

tarafdarTT commented 1 month ago

ahh thanks @marty1885 , I'll have a look at it today and let you know !

tarafdarTT commented 1 month ago

hey @marty1885 I have a fix. You're correct that the tilization was fishy. When doing unpad (slice) on host it used the non-tilized shape to allocate buffer. This commit is a fix: https://github.com/tenstorrent/tt-metal/commit/381ee8c7a6568b5308d8deb1bd6c0037c2f458f8

I'm in the process of adding your test as a unit test to avoid regression on this and once I have that I can merge the above commit to main.

marty1885 commented 1 month ago

@tarafdarTT Thanks! The commit removes a few error for me (But now some of the view tests fails with incorrect shape.). Do you know why the change to reporting tilized sizes are made? I need some time to debug and isolate the remaining operator failures. They seem to more or less relate to shape and strides.

Edit: Now I'm very confused. Converting to row major used to always leave the last 2 dimensions to be padded till 32x32. I'm expecting 0s at the end of each row. But now the padding is not present anymore. Is this expected?

                 Device | INFO     | Opening user mode device driver
2024-08-07 01:12:54.546 | INFO     | SiliconDriver   - Detected 1 PCI device : [0]
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | Running with 1 cqs 
                  Metal | INFO     | AI CLK for device 0 is:   1202 MHz
                  Metal | INFO     | Enabling program cache on device 0
                  Verif | INFO     | Created a random vector of size 100
A:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0.197266 0.193359 -0.6875 -0.10791 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 0.202148 -0.710938 0.416016 0.300781 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 -0.574219 -0.996094 -0.632812 0.984375 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 -0.135742 -0.953125 -0.416016 0.0493164 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 -0.265625 -0.53125 -0.0874023 -0.816406 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.570312 0.236328 -0.597656 -0.234375 0.0284424 0.964844 0.18457 -0.0664062 -0.90625 0.71875 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.214844 0.359375 -0.65625 -0.0986328 -0.867188 -0.972656 0.894531 0.882812 0.929688 0.125977 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.613281 -0.228516 -0.390625 -0.964844 -0.800781 -0.535156 0.367188 -0.515625 -0.119629 0.365234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.753906 0.219727 -0.00964355 0.664062 -0.929688 -0.652344 0.816406 -0.217773 -0.482422 -0.632812 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.324219 0.507812 -0.375 -0.149414 0.0400391 -0.582031 0.0932617 0.134766 -0.628906 -0.933594 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

B:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 -0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 -0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 -0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 0.570312 0.236328 
-0.597656 -0.234375 0.0284424 0.964844 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                 Device | INFO     | Closing user mode device drivers
marty1885 commented 1 month ago

Quick update. I found the updated unpad is giving out the wrong shape. I've updated my test code and got the following output. Unpad should produce tensor of shape [10, 10, 10, 128] but I got [10, 10, 32, 128]. Also it randomly hangs my e150. Likely something lower level is also wrong

int main()
{
    device = &ttnn::device::open_device(0);
    AutoFormat::SetDefaultDevice(device);
    ttnn::enable_program_cache(*device);
    tt::tt_metal::detail::EnablePersistentKernelCache();

    auto a = make_random_tensor({10, 10, 10, 384});
    std::cout << "A shape: " << a.shape() << std::endl;

    Shape start(std::vector<uint32_t>{0, 0, 0, 0});
    Shape end(std::vector<uint32_t>{9, 9, 9, 127});
    auto b = a.cpu().to(tt::tt_metal::Layout::ROW_MAJOR).unpad(start, end);
    std::cout << "Expecting B to be: [10, 10, 10, 128]\n";
    std::cout << "B shape: " << b.shape() << std::endl;
    auto c = ttnn::tilize_with_zero_padding(b.to(device));

    std::cout << "Expecting C to be [10, 10, 10[32], 128]\n"
        << "Got: " << c.shape() << "\n";

    device->close();
}
                 Device | INFO     | Opening user mode device driver
2024-08-07 01:44:04.494 | INFO     | SiliconDriver   - Detected 1 PCI device : [0]
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | Running with 1 cqs 
                  Metal | INFO     | AI CLK for device 0 is:   1202 MHz
                  Metal | INFO     | Enabling program cache on device 0
                  Verif | INFO     | Created a random vector of size 384000
A shape: ttnn.Shape([10, 10, 10[32], 384])
Expecting B to be: [10, 10, 10, 128]
B shape: ttnn.Shape([10, 10, 32, 128])
Expecting C to be [10, 10, 10[32], 128]
Got: ttnn.Shape([10, 10, 32, 128])
                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                 Device | INFO     | Closing user mode device drivers
tarafdarTT commented 1 month ago

hmmm this is strange! I will have a look at this further today

tarafdarTT commented 1 month ago

@marty1885 I solved it. My commit is incorrect and we don't need it. The code is actually working, the only thing funky is your print function dump_first_tile_of_tensor The volume that is being allocated is not of a full tile, it is using the untilized volume for the unpadded tensor. The size of a Tile is the same no matter the shape of the tensor (32x32 = 1024)

What we want is

    uint32_t volume = 1024;  //for volume of a single TILE 
    std::vector<bfloat16> buf(volume);

This then dumps out with your function:

A:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0.197266 0.193359 -0.6875 -0.10791 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 0.202148 -0.710938 0.416016 0.300781 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 -0.574219 -0.996094 -0.632812 0.984375 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 -0.135742 -0.953125 -0.416016 0.0493164 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 -0.265625 -0.53125 -0.0874023 -0.816406 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.570312 0.236328 -0.597656 -0.234375 0.0284424 0.964844 0.18457 -0.0664062 -0.90625 0.71875 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.214844 0.359375 -0.65625 -0.0986328 -0.867188 -0.972656 0.894531 0.882812 0.929688 0.125977 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.613281 -0.228516 -0.390625 -0.964844 -0.800781 -0.535156 0.367188 -0.515625 -0.119629 0.365234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.753906 0.219727 -0.00964355 0.664062 -0.929688 -0.652344 0.816406 -0.217773 -0.482422 -0.632812 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.324219 0.507812 -0.375 -0.149414 0.0400391 -0.582031 0.0932617 0.134766 -0.628906 -0.933594 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

B:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.597656 -0.234375 0.0284424 0.964844 0.730469 -0.332031 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.570312 0.236328 -0.597656 -0.234375 0.0284424 0.964844 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Previously buffer was too small and wasn't allocating enough space for the tensor to include the padding.

Here is another example that I did in python of the same thing:

@pytest.mark.parametrize("n", [1])
@pytest.mark.parametrize("c", [1])
@pytest.mark.parametrize("h", [10])
@pytest.mark.parametrize("w", [10])
def test_tensor_unpad_tiled_input(device, n, c, h, w):
    torch_input_tensor = torch.rand((n, c, h, w), dtype=torch.bfloat16)
    torch_output_tensor = torch_input_tensor[:, :, :6, :6]
    activation_pyt_padded_device = ttnn.from_torch(
        torch_input_tensor,
        dtype=ttnn.DataType.BFLOAT16,
        layout=ttnn.ROW_MAJOR_LAYOUT,
        device = device
    )
    activation_pyt_padded_device_tiled = ttnn.tilize_with_zero_padding(activation_pyt_padded_device)
    activation_pyt_padded_host_tiled = activation_pyt_padded_device_tiled.cpu()
    activation_pyt_padded_host_row_major = activation_pyt_padded_host_tiled.to(ttnn.ROW_MAJOR_LAYOUT) 

    activation_pyt_out_unpadded = activation_pyt_padded_host_row_major.unpad((0, 0, 0, 0), (n - 1, c - 1, 5, 5))

    activation_pyt_padded_out = ttnn.to_torch(activation_pyt_out_unpadded)
    assert_with_pcc(torch_output_tensor, activation_pyt_padded_out, 0.9999)

Underneath the hood our to_torch function takes padding and stuff into consideration. You can have a look at that function if you need some of the intricacies around padding.

marty1885 commented 1 month ago

@tarafdarTT Huh... someone feels wrong. I'll get back to you soon. I think I messed up my environment and executing stuff on my e150 hangs. I need to fix that first.

marty1885 commented 3 weeks ago

@tarafdarTT Sorry for the delay. Issue arises when I upgraded to a newer version of TTNN. I can finally get back to this

Unpad is not acting correctly even in your example. We can find the 2nd row does not match up properly.

A: 
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0.197266 0.193359 -0.6875 -0.10791 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 0.202148 -0.710938 0.416016 0.300781 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 -0.574219 -0.996094 -0.632812 0.984375 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

B:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.597656 -0.234375 0.0284424 0.964844 0.730469 -0.332031 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

It is more clear if we look at column 0.

A:
-0.25
-0.6875
-0.957031

B:
-0.25
-0.597656 <- Differed from tensor A
-0.957031

I checked python an it works correctly

import ttnn
from tt_lib import tensor
import torch
device = ttnn.open_device(0)

shape = (1, 1, 10, 10)
tensor = torch.rand(*shape, dtype=torch.float32)
a = ttnn.from_torch(tensor, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
b = a.cpu().to(ttnn.ROW_MAJOR_LAYOUT).unpad((0, 0, 0, 0), (0, 0, 5, 5))

print(a.cpu().to(ttnn.ROW_MAJOR_LAYOUT).to_torch())
print(b.to_torch())

Which outputs

                 Device | INFO     | Opening user mode device driver
2024-08-20 03:58:13.478 | INFO     | SiliconDriver   - Detected 1 PCI device : [0]
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1202 MHz
tensor([[[[0.3184, 0.4648, 0.9023,  ..., 0.0000, 0.0000, 0.0000],
          [0.2559, 0.8008, 0.3164,  ..., 0.0000, 0.0000, 0.0000],
          [0.8867, 0.6758, 0.3867,  ..., 0.0000, 0.0000, 0.0000],
          ...,
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]]]],
       dtype=torch.bfloat16)
tensor([[[[3.1836e-01, 4.6484e-01, 9.0234e-01, 5.5469e-01, 5.0391e-01,
           2.7344e-01],
          [2.5586e-01, 8.0078e-01, 3.1641e-01, 6.5234e-01, 2.9102e-01,
           6.0938e-01],
          [8.8672e-01, 6.7578e-01, 3.8672e-01, 1.0107e-01, 1.3550e-02,
           7.6953e-01],
          [6.6797e-01, 3.7598e-02, 1.8457e-01, 3.2031e-01, 8.2422e-01,
           2.1172e-04],
          [2.1387e-01, 1.8457e-01, 4.2969e-02, 8.0469e-01, 3.0078e-01,
           2.8125e-01],
          [5.3125e-01, 6.2109e-01, 5.4297e-01, 1.9629e-01, 4.0039e-01,
           8.5547e-01]]]], dtype=torch.bfloat16)
                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                 Device | INFO     | Closing user mode device drivers

With some more digging. I found that it might be related to the to(device) + memcpy combo or one of the conversion ops. it works correctly if I use the storage buffer.

#include <cstddef>
#include <ttnn/operations/eltwise/unary/unary.hpp>
#include <ttnn/operations/eltwise/ternary/where.hpp>
#include <ttnn/device.hpp>
#include <ttnn/operations/data_movement/tilize_with_val_padding/tilize_with_val_padding.hpp>

#include "common/bfloat16.hpp"
#include "tt_dnn/op_library/auto_format.hpp"
#include "ttnn/operations/eltwise/unary/unary_composite.hpp"
#include "ttnn/tensor/tensor.hpp"
#include <tt_metal/detail/persistent_kernel_cache.hpp>
#include "ttnn/tensor/tensor.hpp"
#include "ttnn/tensor/types.hpp"

#include <vector>
#include <iostream>

ttnn::device::Device* device = nullptr;

static tt::tt_metal::Tensor make_random_tensor(tt::tt_metal::Shape s)
{
    static int seed = 42;
     auto b = tt::tt_metal::owned_buffer::create(
        create_random_vector_of_bfloat16_native(
        s[0] * s[1] * s[2] * s[3] * 2
            , 2, seed++, -1));
    tt::tt_metal::Tensor t(OwnedStorage{std::move(b)}, s
        , tt::tt_metal::DataType::BFLOAT16, tt::tt_metal::Layout::ROW_MAJOR);
    return ttnn::tilize_with_zero_padding(t.to(AutoFormat::GetDefaultDevice()));
}

void dump_first_tile_of_tensor(tt::tt_metal::Tensor tensor)
{
    std::cout << "dump_first_tile_of_tensor" << std::endl;
    assert(tensor.dtype() == tt::tt_metal::DataType::BFLOAT16);
    auto t = tensor;
    if(t.storage_type() == tt::tt_metal::StorageType::DEVICE) {
        std::cout << "To CPU " << std::endl;
        t = t.cpu();
    }
    if(t.layout() != tt::tt_metal::Layout::ROW_MAJOR) {
        std::cout << "To ROW" << std::endl;
        t = t.to(tt::tt_metal::Layout::ROW_MAJOR);
    }

    // This fails. Having issues on the 2nd row
    // std::cout << "Copy to device" << std::endl;
    // t = t.to(AutoFormat::GetDefaultDevice());
    // std::vector<bfloat16> buf(1024);
    // memcpy(buf.data(), t);
    // for(int y = 0; y < 32; y++) {
    //     for(int x = 0; x < 32; x++) {
    //         std::cout << buf[y*32+x].to_float() << " ";
    //     }
    //     std::cout << "\n";
    // }
    // std::cout << "\n";

    // This works, however
    auto storage = std::get<tt::tt_metal::OwnedStorage>(t.storage());
    auto buf = std::get<tt::tt_metal::owned_buffer::Buffer<bfloat16>>(storage.get_buffer());
    auto ps = t.shape().with_tile_padding();

    for(int y = 0; y < ps[2]; y++) {
        for(int x = 0; x < ps[3]; x++) {
            std::cout << buf[y * ps[3]+x].to_float() << " ";
        }
        std::cout << "\n";
    }
    std::cout << "\n";
}

int main()
{
    device = &ttnn::device::open_device(0);
    AutoFormat::SetDefaultDevice(device);
    ttnn::enable_program_cache(*device);
    tt::tt_metal::detail::EnablePersistentKernelCache();

    auto a = make_random_tensor({1, 1, 10, 10});

    Shape start(std::vector<uint32_t>{0, 0, 0, 0});
    Shape end(std::vector<uint32_t>{0, 0, 5, 5});
    auto b = a.cpu().to(tt::tt_metal::Layout::ROW_MAJOR).unpad(start, end);

    std::cout << "A:\n";
    dump_first_tile_of_tensor(a);
    std::cout << "B:\n";
    dump_first_tile_of_tensor(b);

    device->close();
}

With output:

                 Device | INFO     | Opening user mode device driver
2024-08-20 04:19:19.493 | INFO     | SiliconDriver   - Detected 1 PCI device : [0]
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1202 MHz
                  Metal | INFO     | Enabling program cache on device 0
A:
dump_first_tile_of_tensor
To CPU 
To ROW
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0.197266 0.193359 -0.6875 -0.10791 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 0.202148 -0.710938 0.416016 0.300781 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 -0.574219 -0.996094 -0.632812 0.984375 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 -0.135742 -0.953125 -0.416016 0.0493164 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 -0.265625 -0.53125 -0.0874023 -0.816406 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.570312 0.236328 -0.597656 -0.234375 0.0284424 0.964844 0.18457 -0.0664062 -0.90625 0.71875 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.214844 0.359375 -0.65625 -0.0986328 -0.867188 -0.972656 0.894531 0.882812 0.929688 0.125977 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.613281 -0.228516 -0.390625 -0.964844 -0.800781 -0.535156 0.367188 -0.515625 -0.119629 0.365234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-0.753906 0.219727 -0.00964355 0.664062 -0.929688 -0.652344 0.816406 -0.217773 -0.482422 -0.632812 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0.324219 0.507812 -0.375 -0.149414 0.0400391 -0.582031 0.0932617 0.134766 -0.628906 -0.933594 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

B:
dump_first_tile_of_tensor
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 
-0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 
-0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 
0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 
0.570312 0.236328 -0.597656 -0.234375 0.0284424 0.964844 

                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                 Device | INFO     | Closing user mode device drivers

I'm on commit 046237fd9c24f51fefd05f66c270c78e606eae85

What we want is uint32_t volume = 1024; //for volume of a single TILE std::vector buf(volume);

I see, thanks! I assumed padded volume is always a multiple of 1024 as that's the tile size. Thanks for point it out.

eyonland commented 6 days ago

@marty1885, I do not believe that we have a bug here. The memcpy seems to work as expected. Take a look specifically at the unit test I added on ttnn-11082-add-test called test_unpad.cpp.
After the memcpy notice the use of device_width so that when the tensor is in row major we are not incorrectly using 32 but rather 6.

    const auto shape = t.get_shape();
    const auto dim = shape.rank();
    const auto width = shape[-1];
    const auto height = shape[-2];
    const auto device_width = t.get_legacy_shape()[-1];