tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
303 stars 26 forks source link

Crash in ~ProgramSrcToDstAddrMap in side branch pkeller/watcher8 #2835

Closed pgkeller closed 9 months ago

pgkeller commented 9 months ago

Pull branch pkeller/watcher8 Last commit causes multiple crashes running: python tests/scripts/run_tt_eager.py In particular, try test (among others): build/test/tt_eager/ops/test_tilize_zero_padding_channels_last

Some tests print "double free" others "corrupted size vs prev size" which indicates writing out of bounds of malloced memory

The last commit checks for redundant open/close of the device and introduces a static vector, presumably that is changing alloc order. Or the code itself is wrong, though I don't think so.

DrJessop commented 9 months ago

I cannot reproduce the issue from the above instructions on my cloud BM (172.27.28.139). Please update the issue with your machine, and provide me with a user so that I can try reproducing.

pgkeller commented 9 months ago

Other tests that fail: test_bmm_op test_tilize_zero_padding

My machine is e13cs03

pgkeller commented 9 months ago

Rebased the branch to pick up latest umd and the 0 size noc transaction fix. Confirmed bmm_op and test_tilize_zero_padding_channels_list fail, but didn't repo test_tilize_zero_padding on the baremetal machine and that just bmm_op fails on an IRD machine, fwiw

DrJessop commented 9 months ago

TO add to this, I can reproduce the double-free error with slow dispatch too

TT_METAL_SLOW_DISPATCH_MODE=1 build/test/tt_eager/ops/test_bmm_op 

will fail with

double free or corruption (!prev)
Aborted (core dumped)
DrJessop commented 9 months ago

Findings, by just including

Tensor temp = tt::tt_metal::add(a, a);

in the test_bmm_op test, we do not hit the double free error. There seems to be something different about add vs bmm/matmul, however I confirmed they all run with the first variant of run_with_autoformat.

I will re-assign to @tt-aho since he is more familiar with the op code, and he can re-assign again if he believes he is not the right fit.

@tt-aho Just to be clear, in Paul's branch, run the following

TT_METAL_SLOW_DISPATCH_MODE=1 build/test/tt_eager/ops/test_bmm_op 

and you will see

double free or corruption (!prev)
Aborted (core dumped)

In the test file (tt-metal/tests/tt_eager/ops/test_bmm_op.cpp), apply the following patch.

commit 186c6440d1077059b65c0167a0d801a652fd08f4
Author: DrJessop <andrewgrebenisan@gmail.com>
Date:   Fri Sep 22 23:41:25 2023 +0000

    #0: Temp

diff --git a/tests/tt_eager/ops/test_bmm_op.cpp b/tests/tt_eager/ops/test_bmm_op.cpp
index 4cc6c90ba..5b9095341 100644
--- a/tests/tt_eager/ops/test_bmm_op.cpp
+++ b/tests/tt_eager/ops/test_bmm_op.cpp
@@ -7,6 +7,8 @@
 #include "tt_dnn/op_library/bmm/bmm_op.hpp"
 #include "common/constants.hpp"
 #include "tt_numpy/functions.hpp"
+#include "tt_dnn/op_library/eltwise_binary/eltwise_binary_op.hpp"
+

 #include <algorithm>
 #include <functional>
@@ -24,26 +26,12 @@ int main(int argc, char **argv) {
     bool pass = true;

     try {
-        ////////////////////////////////////////////////////////////////////////////
-        //                      Initial Runtime Args Parse
-        ////////////////////////////////////////////////////////////////////////////
-        std::vector<std::string> input_args(argv, argv + argc);
-        string arch_name = "";
-        try {
-            std::tie(arch_name, input_args) =
-                test_args::get_command_option_and_remaining_args(input_args, "--arch", "grayskull");
-        } catch (const std::exception& e) {
-            log_fatal(tt::LogTest, "Command line arguments found exception", e.what());
-        }
-        const tt::ARCH arch = tt::get_arch_from_string(arch_name);
         ////////////////////////////////////////////////////////////////////////////
         //                      Device Setup
         ////////////////////////////////////////////////////////////////////////////
         int device_id = 0;
         tt_metal::Device *device = tt_metal::CreateDevice(device_id);

-
-
         ////////////////////////////////////////////////////////////////////////////
         //                      Application Setup
         ////////////////////////////////////////////////////////////////////////////
@@ -61,6 +49,7 @@ int main(int argc, char **argv) {
         Tensor b = tt::numpy::zeros(shapeb, DataType::BFLOAT16).to(Layout::TILE).to(device);
         Tensor b1 = tt::numpy::zeros(shapeb1, DataType::BFLOAT16).to(Layout::TILE).to(device);

+        Tensor temp = tt::tt_metal::add(a, a);
         Tensor mm = bmm(a, b).cpu();
         Tensor mm1 = matmul(a, b1).cpu();

and recompile and rerun, and the error should go away.

pgkeller commented 9 months ago

We don't know 100% for sure that this is an op bug, could be corruption somewhere else that the changes above are perturbing, but an op issue is the leading candidate. This is important as it is gating watcher deployment to CI. fyi: @jvasilje @davorchap

tt-aho commented 9 months ago

Does this only happen with c++ op tests? Or the pytests too?

tt-aho commented 9 months ago

Summary of my testing: Does not seem to be an op bug.

Built with CONFIG=asan for address sanitizer. I ran an eager test that does not use any ops, build/test/tt_eager/tensors/test_host_device_loopback, and I see double free error (I only saw this error either when building with sanitizer, or setting MALLOCCHECK to 3 for example, otherwise regular build did not error out), so it does not seem like an op/op infra issue. I further modified the test with the diff below,

diff --git a/tests/tt_eager/tensors/test_host_device_loopback.cpp b/tests/tt_eager/tensors/test_host_device_loopback.cpp
index 5fbadfce1..f1b063ff7 100644
--- a/tests/tt_eager/tensors/test_host_device_loopback.cpp
+++ b/tests/tt_eager/tensors/test_host_device_loopback.cpp
@@ -4,9 +4,6 @@

 #include "tt_metal/host_api.hpp"
 #include "tensor/tensor.hpp"
-#include "tensor/owned_buffer.hpp"
-#include "tensor/owned_buffer_functions.hpp"
-#include "tt_dnn/op_library/eltwise_binary/eltwise_binary_op.hpp"
 #include "common/constants.hpp"
 #include "tt_numpy/functions.hpp"

@@ -24,11 +21,6 @@ bool test_single_tile_single_dram_bank_loopback(Device *device) {
     Shape single_tile_shape = {1, 1, TILE_HEIGHT, TILE_WIDTH};

     Tensor host_a = tt::numpy::random::random(single_tile_shape).to(Layout::TILE);
-    Tensor device_a = host_a.to(device);
-    Tensor loopbacked_a = device_a.cpu();
-    auto host_a_data = owned_buffer::get_as<bfloat16>(host_a);
-    auto loopbacked_a_data = owned_buffer::get_as<bfloat16>(loopbacked_a);
-    pass &= host_a_data == loopbacked_a_data;

     return pass;
 }
@@ -37,12 +29,6 @@ bool test_multi_tile_multi_dram_bank_loopback(Device *device) {
     bool pass = true;
     Shape multi_tile_shape = {1, 1, 4*TILE_HEIGHT, 3*TILE_WIDTH};

-    Tensor host_a = tt::numpy::random::random(multi_tile_shape).to(Layout::TILE);
-    Tensor device_a = host_a.to(device);
-    Tensor loopbacked_a = device_a.cpu();
-    auto host_a_data = owned_buffer::get_as<bfloat16>(host_a);
-    auto loopbacked_a_data = owned_buffer::get_as<bfloat16>(loopbacked_a);
-    pass &= host_a_data == loopbacked_a_data;
     return pass;
 }

@@ -53,27 +39,13 @@ int main(int argc, char **argv) {
         ////////////////////////////////////////////////////////////////////////////
         //                      Initial Runtime Args Parse
         ////////////////////////////////////////////////////////////////////////////
-        std::vector<std::string> input_args(argv, argv + argc);
-        string arch_name = "";
-        try {
-            std::tie(arch_name, input_args) =
-                test_args::get_command_option_and_remaining_args(input_args, "--arch", "grayskull");
-        } catch (const std::exception& e) {
-            log_fatal(tt::LogTest, "Command line arguments found exception", e.what());
-        }
-        const tt::ARCH arch = tt::get_arch_from_string(arch_name);
+
         ////////////////////////////////////////////////////////////////////////////
         //                      Device Setup
         ////////////////////////////////////////////////////////////////////////////
         int device_id = 0;
         tt_metal::Device *device = tt_metal::CreateDevice(device_id);

-
-
-        pass &= test_single_tile_single_dram_bank_loopback(device);
-
-        pass &= test_multi_tile_multi_dram_bank_loopback(device);
-
         pass &= tt_metal::CloseDevice(device);

     } catch (const std::exception &e) {

This modifies the test to not do anything except open/close device, and I still see double free issue. The output of address sanitizer is below. It is a double free on the static vector Paul added, and seems to be coming from the metal library libtt_metal.so.

==258988==ERROR: AddressSanitizer: attempting double-free on 0x6020000071d0 in thread T0:
    #0 0x7facaa83e51f in operator delete(void*) ../../../../src/libsanitizer/asan/asan_new_delete.cc:165
    #1 0x7faca9783fdd in __cxa_finalize /build/glibc-SzIz7B/glibc-2.31/stdlib/cxa_finalize.c:83
    #2 0x7faca9eeb7a6  (/home/aho/gp.ai3/build/lib/libtt_metal.so+0xd87a6)

0x6020000071d0 is located 0 bytes inside of 4-byte region [0x6020000071d0,0x6020000071d4)
freed by thread T0 here:
    #0 0x7facaa83e51f in operator delete(void*) ../../../../src/libsanitizer/asan/asan_new_delete.cc:165
    #1 0x7faca97838a6 in __run_exit_handlers /build/glibc-SzIz7B/glibc-2.31/stdlib/exit.c:108

previously allocated by thread T0 here:
    #0 0x7facaa83d587 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:104
    #1 0x55848e70fa9d in __gnu_cxx::new_allocator<tt::tt_metal::Device::ActiveState>::allocate(unsigned long, void const*) /usr/include/c++/9/ext/new_allocator.h:114
    #2 0x55848e70fa9d in std::allocator_traits<std::allocator<tt::tt_metal::Device::ActiveState> >::allocate(std::allocator<tt::tt_metal::Device::ActiveState>&, unsigned long) /usr/include/c++/9/bits/alloc_traits.h:443
    #3 0x55848e70fa9d in std::_Vector_base<tt::tt_metal::Device::ActiveState, std::allocator<tt::tt_metal::Device::ActiveState> >::_M_allocate(unsigned long) /usr/include/c++/9/bits/stl_vector.h:343
    #4 0x55848e70fa9d in std::vector<tt::tt_metal::Device::ActiveState, std::allocator<tt::tt_metal::Device::ActiveState> >::_M_default_append(unsigned long) /usr/include/c++/9/bits/vector.tcc:635
    #5 0x55848e706bca in std::vector<tt::tt_metal::Device::ActiveState, std::allocator<tt::tt_metal::Device::ActiveState> >::resize(unsigned long) /usr/include/c++/9/bits/stl_vector.h:937
    #6 0x55848e706bca in tt::tt_metal::Device::activate_device_in_list() tt_metal/impl/device/device.cpp:209
    #7 0x55848e709023 in tt::tt_metal::Device::initialize(std::vector<unsigned int, std::allocator<unsigned int> > const&) tt_metal/impl/device/device.cpp:224
    #8 0x55848e709b14 in tt::tt_metal::Device::Device(int, std::vector<unsigned int, std::allocator<unsigned int> > const&) tt_metal/impl/device/device.cpp:24
    #9 0x7faca9eec552 in tt::tt_metal::CreateDevice(int, std::vector<unsigned int, std::allocator<unsigned int> > const&) tt_metal/tt_metal.cpp:171
    #10 0x55848e594a9a in main tests/tt_eager/tensors/test_host_device_loopback.cpp:47

SUMMARY: AddressSanitizer: double-free ../../../../src/libsanitizer/asan/asan_new_delete.cc:165 in operator delete(void*)
==258988==ABORTING

What is interesting is if I additionally remove the line Tensor host_a = tt::numpy::random::random(single_tile_shape).to(Layout::TILE); from test_single_tile_single_dram_bank_loopback then the test no longer errors out at all.

tt-aho commented 9 months ago

Not sure who should actually look into this so assigning back to @DrJessop for now.

tt-aho commented 9 months ago

I got a better trace after building with 0 optimizations. It looks like both the main exit handler and the ttmetal shared library cleanup routine are both trying to free the global vector. I also tested build/test/tt_metal/test_bmm and do not see double free issue, but if I copy this test and add it to tt_eager tests, then I see this double free issue as well, so this seems like it might be a build/linker issue with how we build tt_eager tests?

==370509==ERROR: AddressSanitizer: attempting double-free on 0x6020000071d0 in thread T0:
    #0 0x7f3125dd051f in operator delete(void*) ../../../../src/libsanitizer/asan/asan_new_delete.cc:165
    #1 0x557383d363e9 in __gnu_cxx::new_allocator<tt::tt_metal::Device::ActiveState>::deallocate(tt::tt_metal::Device::ActiveState*, unsigned long) /usr/include/c++/9/ext/new_allocator.h:128
    #2 0x557383d2e917 in std::allocator_traits<std::allocator<tt::tt_metal::Device::ActiveState> >::deallocate(std::allocator<tt::tt_metal::Device::ActiveState>&, tt::tt_metal::Device::ActiveState*, unsigned long) /usr/include/c++/9/bits/alloc_traits.h:469
    #3 0x557383d28091 in std::_Vector_base<tt::tt_metal::Device::ActiveState, std::allocator<tt::tt_metal::Device::ActiveState> >::_M_deallocate(tt::tt_metal::Device::ActiveState*, unsigned long) /usr/include/c++/9/bits/stl_vector.h:351
    #4 0x557383d21a69 in std::_Vector_base<tt::tt_metal::Device::ActiveState, std::allocator<tt::tt_metal::Device::ActiveState> >::~_Vector_base() /usr/include/c++/9/bits/stl_vector.h:332
    #5 0x557383d3e053 in std::vector<tt::tt_metal::Device::ActiveState, std::allocator<tt::tt_metal::Device::ActiveState> >::~vector() /usr/include/c++/9/bits/stl_vector.h:680
    #6 0x7f31248b5fdd in __cxa_finalize /build/glibc-SzIz7B/glibc-2.31/stdlib/cxa_finalize.c:83
    #7 0x7f31252fbef6  (/home/aho/gp.ai3/build/lib/libtt_metal.so+0x3b6ef6)

0x6020000071d0 is located 0 bytes inside of 4-byte region [0x6020000071d0,0x6020000071d4)
freed by thread T0 here:
    #0 0x7f3125dd051f in operator delete(void*) ../../../../src/libsanitizer/asan/asan_new_delete.cc:165
    #1 0x557383d363e9 in __gnu_cxx::new_allocator<tt::tt_metal::Device::ActiveState>::deallocate(tt::tt_metal::Device::ActiveState*, unsigned long) /usr/include/c++/9/ext/new_allocator.h:128
    #2 0x557383d2e917 in std::allocator_traits<std::allocator<tt::tt_metal::Device::ActiveState> >::deallocate(std::allocator<tt::tt_metal::Device::ActiveState>&, tt::tt_metal::Device::ActiveState*, unsigned long) /usr/include/c++/9/bits/alloc_traits.h:469
    #3 0x557383d28091 in std::_Vector_base<tt::tt_metal::Device::ActiveState, std::allocator<tt::tt_metal::Device::ActiveState> >::_M_deallocate(tt::tt_metal::Device::ActiveState*, unsigned long) /usr/include/c++/9/bits/stl_vector.h:351
    #4 0x557383d21a69 in std::_Vector_base<tt::tt_metal::Device::ActiveState, std::allocator<tt::tt_metal::Device::ActiveState> >::~_Vector_base() /usr/include/c++/9/bits/stl_vector.h:332
    #5 0x557383d3e053 in std::vector<tt::tt_metal::Device::ActiveState, std::allocator<tt::tt_metal::Device::ActiveState> >::~vector() /usr/include/c++/9/bits/stl_vector.h:680
    #6 0x7f31248b58a6 in __run_exit_handlers /build/glibc-SzIz7B/glibc-2.31/stdlib/exit.c:108

previously allocated by thread T0 here:
    #0 0x7f3125dcf587 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:104
    #1 0x557383d3b068 in __gnu_cxx::new_allocator<tt::tt_metal::Device::ActiveState>::allocate(unsigned long, void const*) /usr/include/c++/9/ext/new_allocator.h:114
    #2 0x557383d37d5c in std::allocator_traits<std::allocator<tt::tt_metal::Device::ActiveState> >::allocate(std::allocator<tt::tt_metal::Device::ActiveState>&, unsigned long) /usr/include/c++/9/bits/alloc_traits.h:443
    #3 0x557383d31af5 in std::_Vector_base<tt::tt_metal::Device::ActiveState, std::allocator<tt::tt_metal::Device::ActiveState> >::_M_allocate(unsigned long) /usr/include/c++/9/bits/stl_vector.h:343
    #4 0x557383d2a5c4 in std::vector<tt::tt_metal::Device::ActiveState, std::allocator<tt::tt_metal::Device::ActiveState> >::_M_default_append(unsigned long) /usr/include/c++/9/bits/vector.tcc:635
    #5 0x557383d22a1c in std::vector<tt::tt_metal::Device::ActiveState, std::allocator<tt::tt_metal::Device::ActiveState> >::resize(unsigned long) /usr/include/c++/9/bits/stl_vector.h:937
    #6 0x557383d0e1c7 in tt::tt_metal::Device::activate_device_in_list() tt_metal/impl/device/device.cpp:209
    #7 0x557383d0e67b in tt::tt_metal::Device::initialize(std::vector<unsigned int, std::allocator<unsigned int> > const&) tt_metal/impl/device/device.cpp:224
    #8 0x557383d0a983 in tt::tt_metal::Device::Device(int, std::vector<unsigned int, std::allocator<unsigned int> > const&) tt_metal/impl/device/device.cpp:24
    #9 0x7f31252ff90e in tt::tt_metal::CreateDevice(int, std::vector<unsigned int, std::allocator<unsigned int> > const&) tt_metal/tt_metal.cpp:171
    #10 0x557383ae1c62 in main tests/tt_eager/tensors/test_host_device_loopback.cpp:45
    #11 0x7f3124893082 in __libc_start_main ../csu/libc-start.c:308

SUMMARY: AddressSanitizer: double-free ../../../../src/libsanitizer/asan/asan_new_delete.cc:165 in operator delete(void*)
==370509==ABORTING
pgkeller commented 9 months ago

Hmm, interesting. Thanks for the info. I'm guessing this is a build issue w/ how we link in our .so(s).

pgkeller commented 9 months ago

The test links against libtt_metal.so and libtt_metal_impl.a each of which has a copy of device.o, so this is a build/link issue

pgkeller commented 9 months ago

Fixed by not linking against tt_metal_impl twice