tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
300 stars 23 forks source link

argmax hangs with device_mesh #8932

Open yieldthought opened 1 month ago

yieldthought commented 1 month ago

This code snippet hangs on a t3000:

    tt_lib.device.SetDefaultDevice(t3k_device_mesh.get_device(0))
    tt_out_B11B = ttnn.experimental.tensor.argmax(tt_out_11BH, dim=-1)
    tt_out_1B = ttnn.reshape(tt_out_B11B[:1, :, :, :], ttnn.Shape([1, batch_size]))  # [1, 32] Bfloat16

To reproduce:

  1. check out c9c15d228993e4c6c262d117962471e9d61e3f97
  2. pytest models/demos/t3000/mixtral8x7b/tests/test_mixtral_argmax.py

Expected: Test passes, or at least finishes and fails because I messed up something in the still-to-be-executed comparison code.

Observed: Test hangs. Watcher does not detect any hang. Stack shows main thread is waiting for tensor attributes to be populated.

yieldthought commented 1 month ago

@arakhmati @cfjchu Not sure if this is related to t3k / device mesh or not, we noticed it whilst trying to get embeddings and argmax running on-device in the mixtral codebase. It seems weird that argmax doesn't support device_mesh in any case, I figured combining these might be in some way unsupported leading to this hang?

arakhmati commented 1 month ago

@yieldthought please don't ever use tt_lib.device.SetDefaultDevice. Especially, because it's going to have undefined behaviour with multi-device.

Adding @tt-asaigal to help us triage this

yieldthought commented 1 month ago

Yeah makes sense - we need tensor.argmax to support multi-device is the real ask here I guess

cfjchu commented 1 month ago

I'll take a look at this tomorrow!

tt-asaigal commented 4 weeks ago

I've added support to make the mixtral test pass here: https://github.com/tenstorrent/tt-metal/commit/b66f8374d81da8ae8c9e1120e64332df4ea1e461.

The issue here is that we don't support autoformat (i.e. running device ops with inputs/outputs on host) in the multi-device case. The fix was to explicitly perform data-movement, rather than relying on autoformat.

We're also missing sweep tests for this op. We should really add those. I'll make a separate PR with some tests for the multi-device case.

yieldthought commented 4 weeks ago

Wait, are you saying that running tensor.argmax on a tensor that exists on device to get an output on device performs synchronization with the host?

yieldthought commented 4 weeks ago

The argmax test itself passes, but when I try to use argmax in our actual demo code with the same lines I hit this ASSERT:

Always | FATAL | TODO: add support for multi-paged buffer with page size > 64KB

The stack trace shows this is as a result of the new .to call inside argmax:

#6  0x00007fff87a84993 in tt::assert::tt_throw<char [63]> (file=0x7fff86f825f0 "../tt_metal/impl/dispatch/command_queue.cpp", line=1534, assert_type="TT_FATAL", condition_str=0x7fff86f821ac "buffer.num_pages() == 1") at ../tt_metal/common/assert.hpp:156
#7  0x00007fff86ed81bd in tt::tt_metal::HWCommandQueue::enqueue_write_buffer (this=0x6c9adc0, buffer=..., src=0x7ffc3c7c95f0, blocking=false) at ../tt_metal/impl/dispatch/command_queue.cpp:1534
#8  0x00007fff86ed731b in tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)>::<lambda(auto:48&&)>::operator()<std::shared_ptr<const tt::tt_metal::Buffer>&>(std::shared_ptr<tt::tt_metal::Buffer const> &) const (this=0x6c9adc0, b=
    std::shared_ptr<const class tt::tt_metal::Buffer> (use count 6, weak count 0) = {...}) at ../tt_metal/impl/dispatch/command_queue.cpp:1408
#9  0x00007fff86ee308d in std::__invoke_impl<void, tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:47&&)> [with auto:47 = void const*&]::<lambda(auto:48&&)>, std::shared_ptr<const tt::tt_metal::Buffer>&>(std::__invoke_other, tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)>::<lambda(auto:48&&)> &&) (__f=...) at /usr/include/c++/9/bits/invoke.h:60
#10 0x00007fff86ee200b in std::__invoke<tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:47&&)> [with auto:47 = void const*&]::<lambda(auto:48&&)>, std::shared_ptr<const tt::tt_metal::Buffer>&>(tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)>::<lambda(auto:48&&)> &&) (__fn=...) at /usr/include/c++/9/bits/invoke.h:95
#11 0x00007fff86ed7366 in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<void (*)(tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:47&&)> [with auto:47 = void const*&]::<lambda(auto:48&&)>&&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >&)>, std::tuple<std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >&>, std::integer_sequence<long unsigned int, 1> >::__visit_invoke_impl(tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)>::<lambda(auto:48&&)> &&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer const> > &) (__visitor=..., __vars#0=
    std::variant<class std::reference_wrapper<tt::tt_metal::Buffer>, class std::shared_ptr<tt::tt_metal::Buffer const>> [index 1] containing std::shared_ptr<const class tt::tt_metal::Buffer> (use count 6, weak count 0) = {...}) at /usr/include/c++/9/variant:981
#12 0x00007fff86ed73a5 in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<void (*)(tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:47&&)> [with auto:47 = void const*&]::<lambda(auto:48&&)>&&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >&)>, std::tuple<std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >&>, std::integer_sequence<long unsigned int, 1> >::__do_visit_invoke(tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)>::<lambda(auto:48&&)> &&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer const> > &) (__visitor=...,
    __vars#0=std::variant<class std::reference_wrapper<tt::tt_metal::Buffer>, class std::shared_ptr<tt::tt_metal::Buffer const>> [index 1] containing std::shared_ptr<const class tt::tt_metal::Buffer> (use count 6, weak count 0) = {...}) at /usr/include/c++/9/variant:989
#13 0x00007fff86ed73e8 in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<void (*)(tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:47&&)> [with auto:47 = void const*&]::<lambda(auto:48&&)>&&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >&)>, std::tuple<std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >&>, std::integer_sequence<long unsigned int, 1> >::__visit_invoke(tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)>::<lambda(auto:48&&)> &&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer const> > &) (__visitor=...,
    __vars#0=std::variant<class std::reference_wrapper<tt::tt_metal::Buffer>, class std::shared_ptr<tt::tt_metal::Buffer const>> [index 1] containing std::shared_ptr<const class tt::tt_metal::Buffer> (use count 6, weak count 0) = {...}) at /usr/include/c++/9/variant:1005
#14 0x00007fff86ed7487 in std::__do_visit<>(tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)>::<lambda(auto:48&&)> &&) (__visitor=...) at /usr/include/c++/9/variant:1652
#15 0x00007fff86ed74e4 in std::visit<tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:47&&)> [with auto:47 = void const*&]::<lambda(auto:48&&)>, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >&>(tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)>::<lambda(auto:48&&)> &&) (__visitor=...) at /usr/include/c++/9/variant:1663
#16 0x00007fff86ed754d in tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)>::operator()<void const*&>(const void *&) const (__closure=0x7ffd40ff72d0, data=@0x7ffd40ff7330: 0x7ffc3c7c95f0) at ../tt_metal/impl/dispatch/command_queue.cpp:1403
#17 0x00007fff86ee30cc in std::__invoke_impl<void, tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:47&&)>, void const*&>(std::__invoke_other, tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)> &&) (__f=...) at /usr/include/c++/9/bits/invoke.h:60
#18 0x00007fff86ee2059 in std::__invoke<tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:47&&)>, void const*&>(tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)> &&) (__fn=...) at /usr/include/c++/9/bits/invoke.h:95
#19 0x00007fff86ed75a3 in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<void (*)(tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:47&&)>&&, std::variant<const std::shared_ptr<std::vector<short unsigned int, std::allocator<short unsigned int> > >, const std::shared_ptr<std::vector<int, std::allocator<int> > >, const std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > >, const std::shared_ptr<std::vector<float, std::allocator<float> > >, const std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > >, void const*>&)>, std::tuple<std::variant<const std::shared_ptr<std::vector<short unsigned int, std::allocator<short unsigned int> > >, const std::shared_ptr<std::vector<int, std::allocator<int> > >, const std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > >, const std::shared_ptr<std::vector<float, std::allocator<float> > >, const std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > >, void const*>&>, std::integer_sequence<long unsigned int, 5> >::__visit_invoke_impl(tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)> &&, std::variant<std::shared_ptr<std::vector<unsigned short, std::allocator<unsigned short> > > const, std::shared_ptr<std::vector<int, std::allocator<int> > > const, std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > > const, std::shared_ptr<std::vector<float, std::allocator<float> > > const, std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > > const, void const*> &) (__visitor=...,
    __vars#0=std::variant<const class std::shared_ptr<std::vector<unsigned short, std::allocator<unsigned short> > >, const class std::shared_ptr<std::vector<int, std::allocator<int> > >, const class std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > >, const class std::shared_ptr<std::vector<float, std::allocator<float> > >, const class std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > >, const void *> [index 5] = {...}) at /usr/include/c++/9/variant:981
#20 0x00007fff86ed75e2 in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<void (*)(tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:47&&)>&&, std::variant<const std::shared_ptr<std::vector<short unsigned int, std::allocator<short unsigned int> > >, const std::shared_ptr<std::vector<int, std::allocator<int> > >, const std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > >, const std::shared_ptr<std::vector<float, std::allocator<float> > >, const std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > >, void const*>&)>, std::tuple<std::variant<const std::shared_ptr<std::vector<short unsigned int, std::allocator<short unsigned int> > >, const std::shared_ptr<std::vector<int, std::allocator<int> > >, const std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > >, const std::shared_ptr<std::vector<float, std::allocator<float> > >, const std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > >, void const*>&>, std::integer_sequence<long unsigned int, 5> >::__do_visit_invoke(tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)> &&, std::variant<std::shared_ptr<std::vector<unsigned short, std::allocator<unsigned short> --Type <RET> for more, q to quit, c to continue without paging--
> > const, std::shared_ptr<std::vector<int, std::allocator<int> > > const, std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > > const, std::shared_ptr<std::vector<float, std::allocator<float> > > const, std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > > const, void const*> &) (__visitor=...,
    __vars#0=std::variant<const class std::shared_ptr<std::vector<unsigned short, std::allocator<unsigned short> > >, const class std::shared_ptr<std::vector<int, std::allocator<int> > >, const class std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > >, const class std::shared_ptr<std::vector<float, std::allocator<float> > >, const class std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > >, const void *> [index 5] = {...}) at /usr/include/c++/9/variant:989
#21 0x00007fff86ed7625 in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<void (*)(tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:47&&)>&&, std::variant<const std::shared_ptr<std::vector<short unsigned int, std::allocator<short unsigned int> > >, const std::shared_ptr<std::vector<int, std::allocator<int> > >, const std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > >, const std::shared_ptr<std::vector<float, std::allocator<float> > >, const std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > >, void const*>&)>, std::tuple<std::variant<const std::shared_ptr<std::vector<short unsigned int, std::allocator<short unsigned int> > >, const std::shared_ptr<std::vector<int, std::allocator<int> > >, const std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > >, const std::shared_ptr<std::vector<float, std::allocator<float> > >, const std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > >, void const*>&>, std::integer_sequence<long unsigned int, 5> >::__visit_invoke(tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)> &&, std::variant<std::shared_ptr<std::vector<unsigned short, std::allocator<unsigned short> > > const, std::shared_ptr<std::vector<int, std::allocator<int> > > const, std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > > const, std::shared_ptr<std::vector<float, std::allocator<float> > > const, std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > > const, void const*> &) (__visitor=...,
    __vars#0=std::variant<const class std::shared_ptr<std::vector<unsigned short, std::allocator<unsigned short> > >, const class std::shared_ptr<std::vector<int, std::allocator<int> > >, const class std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > >, const class std::shared_ptr<std::vector<float, std::allocator<float> > >, const class std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > >, const void *> [index 5] = {...}) at /usr/include/c++/9/variant:1005
#22 0x00007fff86ed76c5 in std::__do_visit<>(tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)> &&) (__visitor=...) at /usr/include/c++/9/variant:1652
#23 0x00007fff86ed7722 in std::visit<tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<const tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:47&&)>, std::variant<const std::shared_ptr<std::vector<short unsigned int, std::allocator<short unsigned int> > >, const std::shared_ptr<std::vector<int, std::allocator<int> > >, const std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > >, const std::shared_ptr<std::vector<float, std::allocator<float> > >, const std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > >, void const*>&>(tt::tt_metal::HWCommandQueue::<lambda(auto:47&&)> &&) (__visitor=...) at /usr/include/c++/9/variant:1663
#24 0x00007fff86ed7781 in tt::tt_metal::HWCommandQueue::enqueue_write_buffer (this=0x6c9adc0,
    buffer=std::variant<class std::reference_wrapper<tt::tt_metal::Buffer>, class std::shared_ptr<tt::tt_metal::Buffer const>> [index 1] containing std::shared_ptr<const class tt::tt_metal::Buffer> (use count 6, weak count 0) = {...},
    src=std::variant<const class std::shared_ptr<std::vector<unsigned short, std::allocator<unsigned short> > >, const class std::shared_ptr<std::vector<int, std::allocator<int> > >, const class std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > >, const class std::shared_ptr<std::vector<float, std::allocator<float> > >, const class std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > >, const void *> [index 5] = {...}, blocking=false) at ../tt_metal/impl/dispatch/command_queue.cpp:1400
#25 0x00007fff86edcc44 in tt::tt_metal::<lambda(auto:55&&)>::operator()<std::shared_ptr<tt::tt_metal::Buffer>&>(std::shared_ptr<tt::tt_metal::Buffer> &) const (__closure=0x7ffd40ff74f0, b=std::shared_ptr<class tt::tt_metal::Buffer> (use count 6, weak count 0) = {...})
    at ../tt_metal/impl/dispatch/command_queue.cpp:2242
#26 0x00007fff86ee343e in std::__invoke_impl<void, tt::tt_metal::EnqueueWriteBufferImpl(tt::tt_metal::CommandQueue&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:55&&)>, std::shared_ptr<tt::tt_metal::Buffer>&>(std::__invoke_other, tt::tt_metal::<lambda(auto:55&&)> &&) (__f=...) at /usr/include/c++/9/bits/invoke.h:60
#27 0x00007fff86ee2698 in std::__invoke<tt::tt_metal::EnqueueWriteBufferImpl(tt::tt_metal::CommandQueue&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:55&&)>, std::shared_ptr<tt::tt_metal::Buffer>&>(tt::tt_metal::<lambda(auto:55&&)> &&) (__fn=...) at /usr/include/c++/9/bits/invoke.h:95
#28 0x00007fff86edcce5 in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<void (*)(tt::tt_metal::EnqueueWriteBufferImpl(tt::tt_metal::CommandQueue&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:55&&)>&&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> >&)>, std::tuple<std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> >&>, std::integer_sequence<long unsigned int, 1> >::__visit_invoke_impl(tt::tt_metal::<lambda(auto:55&&)> &&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> > &) (__visitor=...,
    __vars#0=std::variant<class std::reference_wrapper<tt::tt_metal::Buffer>, class std::shared_ptr<tt::tt_metal::Buffer>> [index 1] containing std::shared_ptr<class tt::tt_metal::Buffer> (use count 6, weak count 0) = {...}) at /usr/include/c++/9/variant:981
#29 0x00007fff86edcd24 in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<void (*)(tt::tt_metal::EnqueueWriteBufferImpl(tt::tt_metal::CommandQueue&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:55&&)>&&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> >&)>, std::tuple<std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> >&>, std::integer_sequence<long unsigned int, 1> >::__do_visit_invoke(tt::tt_metal::<lambda(auto:55&&)> &&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> > &) (__visitor=...,
    __vars#0=std::variant<class std::reference_wrapper<tt::tt_metal::Buffer>, class std::shared_ptr<tt::tt_metal::Buffer>> [index 1] containing std::shared_ptr<class tt::tt_metal::Buffer> (use count 6, weak count 0) = {...}) at /usr/include/c++/9/variant:989
#30 0x00007fff86edcd67 in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<void (*)(tt::tt_metal::EnqueueWriteBufferImpl(tt::tt_metal::CommandQueue&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:55&&)>&&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> >&)>, std::tuple<std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> >&>, std::integer_sequence<long unsigned int, 1> >::__visit_invoke(tt::tt_metal::<lambda(auto:55&&)> &&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> > &) (__visitor=...,
    __vars#0=std::variant<class std::reference_wrapper<tt::tt_metal::Buffer>, class std::shared_ptr<tt::tt_metal::Buffer>> [index 1] containing std::shared_ptr<class tt::tt_metal::Buffer> (use count 6, weak count 0) = {...}) at /usr/include/c++/9/variant:1005
#31 0x00007fff86edce07 in std::__do_visit<>(tt::tt_metal::<lambda(auto:55&&)> &&) (__visitor=...) at /usr/include/c++/9/variant:1652
#32 0x00007fff86edce64 in std::visit<tt::tt_metal::EnqueueWriteBufferImpl(tt::tt_metal::CommandQueue&, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> >, tt::tt_metal::HostDataType, bool)::<lambda(auto:55&&)>, std::variant<std::reference_wrapper<tt::tt_metal::Buffer>, std::shared_ptr<tt::tt_metal::Buffer> >&>(tt::tt_metal::<lambda(auto:55&&)> &&) (__visitor=...) at /usr/include/c++/9/variant:1663
#33 0x00007fff86edced2 in tt::tt_metal::EnqueueWriteBufferImpl (cq=...,
    buffer=std::variant<class std::reference_wrapper<tt::tt_metal::Buffer>, class std::shared_ptr<tt::tt_metal::Buffer>> [index 1] containing std::shared_ptr<class tt::tt_metal::Buffer> (use count 6, weak count 0) = {...},
    src=std::variant<const class std::shared_ptr<std::vector<unsigned short, std::allocator<unsigned short> > >, const class std::shared_ptr<std::vector<int, std::allocator<int> > >, const class std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > >, const class std::shared_ptr<std::vector<float, std::allocator<float> > >, const class std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > >, const void *> [index 5] = {...}, blocking=false) at ../tt_metal/impl/dispatch/command_queue.cpp:2237
#34 0x00007fff86ee000e in tt::tt_metal::CommandQueue::run_command_impl (this=0x6ccaf00, command=...) at ../tt_metal/impl/dispatch/command_queue.cpp:2553
#35 0x00007fff86edf93c in tt::tt_metal::CommandQueue::run_command (this=0x6ccaf00, command=...) at ../tt_metal/impl/dispatch/command_queue.cpp:2534
#36 0x00007fff86edc967 in tt::tt_metal::EnqueueWriteBuffer (cq=...,
    buffer=std::variant<class std::reference_wrapper<tt::tt_metal::Buffer>, class std::shared_ptr<tt::tt_metal::Buffer>> [index 1] containing std::shared_ptr<class tt::tt_metal::Buffer> (use count 6, weak count 0) = {...},
    src=std::variant<const class std::shared_ptr<std::vector<unsigned short, std::allocator<unsigned short> > >, const class std::shared_ptr<std::vector<int, std::allocator<int> > >, const class std::shared_ptr<std::vector<unsigned int, std::allocator<unsigned int> > >, const--Type <RET> for more, q to quit, c to continue without paging--c
 class std::shared_ptr<std::vector<float, std::allocator<float> > >, const class std::shared_ptr<std::vector<bfloat16, std::allocator<bfloat16> > >, const void *> [index 5] = {...}, blocking=false) at ../tt_metal/impl/dispatch/command_queue.cpp:2228
#37 0x00007fff88299e59 in tt::tt_metal::tensor_impl::write_data_to_device_buffer<bfloat16, tt::tt_metal::borrowed_buffer::Buffer> (cq=..., host_buffer=..., device_buffer=std::shared_ptr<class tt::tt_metal::Buffer> (use count 6, weak count 0) = {...}) at ../tt_eager/tensor/tensor_impl.hpp:274
#38 0x00007fff8827f2f3 in tt::tt_metal::tensor_impl::initialize_data_on_device<bfloat16, tt::tt_metal::borrowed_buffer::Buffer> (data_to_write=..., device=0x6c54370, shape=..., data_type=tt::tt_metal::DataType::BFLOAT16, layout=tt::tt_metal::Layout::ROW_MAJOR, memory_config=..., shard_spec=std::optional<struct tt::tt_metal::ShardSpecBuffer> [no contained value], queue=std::optional<class std::reference_wrapper<tt::tt_metal::CommandQueue>> [no contained value]) at ../tt_eager/tensor/tensor_impl.hpp:306
#39 0x00007fff8826dc37 in tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}::operator()<tt::tt_metal::OwnedStorage const&>(tt::tt_metal::OwnedStorage const&) const (this=0x7ffd40ff7d20, storage=...) at ../tt_eager/tensor/tensor_impl.hpp:344
#40 0x00007fff88256c38 in std::__invoke_impl<std::shared_ptr<tt::tt_metal::Buffer>, tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}, tt::tt_metal::OwnedStorage const&>(std::__invoke_other, tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}&&, tt::tt_metal::OwnedStorage const&) (__f=...) at /usr/include/c++/9/bits/invoke.h:60
#41 0x00007fff88249540 in std::__invoke<tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}, tt::tt_metal::OwnedStorage const&>(tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}&&, (std::__invoke_result&&)...) (__fn=...) at /usr/include/c++/9/bits/invoke.h:96
#42 0x00007fff8823d51f in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<std::shared_ptr<tt::tt_metal::Buffer> (*)(tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}&&, std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&)>, std::tuple<std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&>, std::integer_sequence<unsigned long, 0ul> >::__visit_invoke_impl(tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}, std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&) (__visitor=..., __vars#0=std::variant<struct tt::tt_metal::OwnedStorage, struct tt::tt_metal::DeviceStorage, struct tt::tt_metal::BorrowedStorage, struct tt::tt_metal::MultiDeviceHostStorage, struct tt::tt_metal::MultiDeviceStorage> [index 0] = {...}) at /usr/include/c++/9/variant:983
#43 0x00007fff8823d596 in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<std::shared_ptr<tt::tt_metal::Buffer> (*)(tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}&&, std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&)>, std::tuple<std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&>, std::integer_sequence<unsigned long, 0ul> >::__do_visit_invoke(tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}, std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&) (__visitor=..., __vars#0=std::variant<struct tt::tt_metal::OwnedStorage, struct tt::tt_metal::DeviceStorage, struct tt::tt_metal::BorrowedStorage, struct tt::tt_metal::MultiDeviceHostStorage, struct tt::tt_metal::MultiDeviceStorage> [index 0] = {...}) at /usr/include/c++/9/variant:990
#44 0x00007fff8823d60d in std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<std::shared_ptr<tt::tt_metal::Buffer> (*)(tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}&&, std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&)>, std::tuple<std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&>, std::integer_sequence<unsigned long, 0ul> >::__visit_invoke(tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}, std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&) (__visitor=..., __vars#0=std::variant<struct tt::tt_metal::OwnedStorage, struct tt::tt_metal::DeviceStorage, struct tt::tt_metal::BorrowedStorage, struct tt::tt_metal::MultiDeviceHostStorage, struct tt::tt_metal::MultiDeviceStorage> [index 0] = {...}) at /usr/include/c++/9/variant:1006
#45 0x00007fff8823dd34 in std::__do_visit<false, true, tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}, std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&>(tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}&&, std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&) (__visitor=...) at /usr/include/c++/9/variant:1653
#46 0x00007fff8823ddc9 in std::visit<tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}, std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&>(tt::tt_metal::tensor_impl::to_device_buffer<bfloat16>(std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::optional<tt::tt_metal::ShardSpecBuffer> const&, std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >)::{lambda(auto:1&&)#1}&&, std::variant<tt::tt_metal::OwnedStorage, tt::tt_metal::DeviceStorage, tt::tt_metal::BorrowedStorage, tt::tt_metal::MultiDeviceHostStorage, tt::tt_metal::MultiDeviceStorage> const&) (__visitor=...) at /usr/include/c++/9/variant:1664
#47 0x00007fff8823ded9 in tt::tt_metal::tensor_impl::to_device_buffer<bfloat16> (storage=std::variant<struct tt::tt_metal::OwnedStorage, struct tt::tt_metal::DeviceStorage, struct tt::tt_metal::BorrowedStorage, struct tt::tt_metal::MultiDeviceHostStorage, struct tt::tt_metal::MultiDeviceStorage> [index 0] = {...}, device=0x6c54370, shape=..., data_type=tt::tt_metal::DataType::BFLOAT16, layout=tt::tt_metal::Layout::ROW_MAJOR, memory_config=..., shard_spec=std::optional<struct tt::tt_metal::ShardSpecBuffer> [no contained value], queue=std::optional<class std::reference_wrapper<tt::tt_metal::CommandQueue>> [no contained value]) at ../tt_eager/tensor/tensor_impl.hpp:355
#48 0x00007fff8821e566 in tt::tt_metal::tensor_impl::to_device<bfloat16> (tensor=..., target_device=0x6c54370, memory_config=..., queue=std::optional<class std::reference_wrapper<tt::tt_metal::CommandQueue>> [no contained value]) at ../tt_eager/tensor/tensor_impl.hpp:471
#49 0x00007fff88212f8a in _ZZN2tt8tt_metal11tensor_impl17to_device_wrapperIJRNS0_6TensorERPNS0_6DeviceERKNS0_12MemoryConfigERKSt9nullopt_tEEEDaDpOT_ENKUlDpOT0_E_clI8bfloat16JS4_S7_SA_SD_EEEDaSJ_ (this=0x7ffd40ff823f) at ../tt_eager/tensor/tensor_impl_wrapper.hpp:42
#50 0x00007fff8821349f in _ZN2tt8tt_metal11tensor_impl8dispatchIZNS1_17to_device_wrapperIJRNS0_6TensorERPNS0_6DeviceERKNS0_12MemoryConfigERKSt9nullopt_tEEEDaDpOT_EUlDpOT0_E_JS5_S8_SB_SE_EEEDaNS0_8DataTypeEOT_SK_ (dtype=tt::tt_metal::DataType::BFLOAT16, func=...) at ../tt_eager/tensor/tensor_impl_wrapper.hpp:15
#51 0x00007fff88213721 in tt::tt_metal::tensor_impl::to_device_wrapper<tt::tt_metal::Tensor&, tt::tt_metal::Device*&, tt::tt_metal::MemoryConfig const&, std::nullopt_t const&> () at ../tt_eager/tensor/tensor_impl_wrapper.hpp:42
#52 0x00007fff881f4bd4 in tt::tt_metal::Tensor::<lambda()>::operator()(void) (__closure=0x7ffc3c5eb0e0) at ../tt_eager/tensor/tensor.cpp:416
#53 0x00007fff882029fe in std::_Function_handler<void(), tt::tt_metal::Tensor::to(tt::tt_metal::Device*, const tt::tt_metal::MemoryConfig&) const::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#54 0x00007fff88faef8c in std::function<void ()>::operator()() const (this=0x7ffd40ff84b0) at /usr/include/c++/9/bits/std_function.h:688
#55 0x00007fff86deac72 in tt::WorkExecutor::push_work(std::function<void ()> const&, bool) (this=0x6c54600, work_executor=..., blocking=false) at ../tt_metal/impl/dispatch/work_executor.hpp:139
#56 0x00007fff86de819d in tt::tt_metal::Device::push_work(std::function<void ()>&&, bool) (this=0x6c54370, work=..., blocking=false) at ../tt_metal/impl/device/device.cpp:1679
#57 0x00007fff881f4e70 in tt::tt_metal::Tensor::to (this=0x7ffd40ff8800, target_device=0x6c54370, mem_config=...) at ../tt_eager/tensor/tensor.cpp:405
#58 0x00007fff87fbe8db in tt::tt_metal::<lambda(const std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> >&, const std::vector<std::optional<const tt::tt_metal::Tensor>, std::allocator<std::optional<const tt::tt_metal::Tensor> > >&, const std::vector<std::optional<tt::tt_metal::Tensor>, std::allocator<std::optional<tt::tt_metal::Tensor> > >&)>::operator()(const std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > &, const std::vector<std::optional<tt::tt_metal::Tensor const>, std::allocator<std::optional<tt::tt_metal::Tensor const> > > &, const std::vector<std::optional<tt::tt_metal::Tensor>, std::allocator<std::optional<tt::tt_metal::Tensor> > > &) (__closure=0xa816260, input_tensors=std::vector of length 1, capacity 1 = {...}, optional_input_tensors=std::vector of length 0, capacity 0, optional_output_tensors=std::vector of length 0, capacity 0) at ../tt_eager/tt_dnn/op_library/composite/composite_ops.cpp:1613
#59 0x00007fff87fc773f in std::_Function_handler<std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> >(const std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> >&, const std::vector<std::optional<const tt::tt_metal::Tensor>, std::allocator<std::optional<const tt::tt_metal::Tensor> > >&, const std::vector<std::optional<tt::tt_metal::Tensor>, std::allocator<std::optional<tt::tt_metal::Tensor> > >&), tt::tt_metal::_argmax(const tt::tt_metal::Tensor&, int64_t, bool, const tt::tt_metal::MemoryConfig&)::<lambda(const std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> >&, const std::vector<std::optional<const tt::tt_metal::Tensor>, std::allocator<std::optional<const tt::tt_metal::Tensor> > >&, const std::vector<std::optional<tt::tt_metal::Tensor>, std::allocator<std::optional<tt::tt_metal::Tensor> > >&)> >::_M_invoke(const std::_Any_data &, const std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > &, const std::vector<std::optional<tt::tt_metal::Tensor const>, std::allocator<std::optional<tt::tt_metal::Tensor const> > > &, const std::vector<std::optional<tt::tt_metal::Tensor>, std::allocator<std::optional<tt::tt_metal::Tensor> > > &) (__functor=..., __args#0=std::vector of length 1, capacity 1 = {...}, __args#1=std::vector of length 0, capacity 0, __args#2=std::vector of length 0, capacity 0) at /usr/include/c++/9/bits/std_function.h:286
#60 0x00007fff8808cba5 in std::function<std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > (std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > const&, std::vector<std::optional<tt::tt_metal::Tensor const>, std::allocator<std::optional<tt::tt_metal::Tensor const> > > const&, std::vector<std::optional<tt::tt_metal::Tensor>, std::allocator<std::optional<tt::tt_metal::Tensor> > > const&)>::operator()(std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > const&, std::vector<std::optional<tt::tt_metal::Tensor const>, std::allocator<std::optional<tt::tt_metal::Tensor const> > > const&, std::vector<std::optional<tt::tt_metal::Tensor>, std::allocator<std::optional<tt::tt_metal::Tensor> > > const&) const (this=0xa8ae968, __args#0=std::vector of length 1, capacity 1 = {...}, __args#1=std::vector of length 0, capacity 0, __args#2=std::vector of length 0, capacity 0) at /usr/include/c++/9/bits/std_function.h:688
#61 0x00007fff88079e9c in tt::tt_metal::operation::<lambda(tt::tt_metal::Device*)>::operator()(tt::tt_metal::Device *) (__closure=0xa8ae960, target_device=0x6c54370) at ../tt_eager/tt_dnn/op_library/run_operation.cpp:792
#62 0x00007fff880832fe in std::_Function_handler<void(tt::tt_metal::Device*), tt::tt_metal::operation::launch_op(std::function<std::vector<tt::tt_metal::Tensor>(const std::vector<tt::tt_metal::Tensor>&, const std::vector<std::optional<const tt::tt_metal::Tensor> >&, const std::vector<std::optional<tt::tt_metal::Tensor> >&)>&&, tt::tt_metal::operation::Tensors, tt::tt_metal::operation::Tensors&, tt::tt_metal::operation::OptionalConstTensors, tt::tt_metal::operation::OptionalTensors, bool)::<lambda(tt::tt_metal::Device*)> >::_M_invoke(const std::_Any_data &, tt::tt_metal::Device *&&) (__functor=..., __args#0=@0x7ffd40ff8d30: 0x6c54370) at /usr/include/c++/9/bits/std_function.h:300
#63 0x00007fff8808d6db in std::function<void (tt::tt_metal::Device*)>::operator()(tt::tt_metal::Device*) const (this=0xa528ee0, __args#0=0x6c54370) at /usr/include/c++/9/bits/std_function.h:688
#64 0x00007fff8807a84e in tt::tt_metal::operation::<lambda()>::operator()(void) (__closure=0xa1cd470) at ../tt_eager/tt_dnn/op_library/run_operation.cpp:833
#65 0x00007fff8808345e in std::_Function_handler<void(), tt::tt_metal::operation::launch_op(std::function<std::vector<tt::tt_metal::Tensor>(const std::vector<tt::tt_metal::Tensor>&, const std::vector<std::optional<const tt::tt_metal::Tensor> >&, const std::vector<std::optional<tt::tt_metal::Tensor> >&)>&&, tt::tt_metal::operation::Tensors, tt::tt_metal::operation::Tensors&, tt::tt_metal::operation::OptionalConstTensors, tt::tt_metal::operation::OptionalTensors, bool)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#66 0x00007fff88faef8c in std::function<void ()>::operator()() const (this=0x96359a0) at /usr/include/c++/9/bits/std_function.h:688
#67 0x00007fff86deab62 in tt::WorkExecutor::run_worker (this=0x6c54600) at ../tt_metal/impl/dispatch/work_executor.hpp:129
#68 0x00007fff86e355ec in std::__invoke_impl<void, void (tt::WorkExecutor::*)(), tt::WorkExecutor*> (__f=@0x6bdf0a0: (void (tt::WorkExecutor::*)(class tt::WorkExecutor * const)) 0x7fff86deaab8 <tt::WorkExecutor::run_worker()>, __t=@0x6bdf098: 0x6c54600) at /usr/include/c++/9/bits/invoke.h:73
#69 0x00007fff86e3551a in std::__invoke<void (tt::WorkExecutor::*)(), tt::WorkExecutor*> (__fn=@0x6bdf0a0: (void (tt::WorkExecutor::*)(class tt::WorkExecutor * const)) 0x7fff86deaab8 <tt::WorkExecutor::run_worker()>) at /usr/include/c++/9/bits/invoke.h:95
#70 0x00007fff86e35479 in std::thread::_Invoker<std::tuple<void (tt::WorkExecutor::*)(), tt::WorkExecutor*> >::_M_invoke<0ul, 1ul> (this=0x6bdf098) at /usr/include/c++/9/thread:244
#71 0x00007fff86e352ec in std::thread::_Invoker<std::tuple<void (tt::WorkExecutor::*)(), tt::WorkExecutor*> >::operator() (this=0x6bdf098) at /usr/include/c++/9/thread:251
#72 0x00007fff86e35164 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tt::WorkExecutor::*)(), tt::WorkExecutor*> > >::_M_run (this=0x6bdf090) at /usr/include/c++/9/thread:195
yieldthought commented 4 weeks ago

The buffer:

(gdb) p buffer
$1 = (const tt::tt_metal::Buffer &) @0x7ffc3c2c82b0: {
  _vptr.Buffer = 0x7fff870ddc70 <vtable for tt::tt_metal::Buffer+16>,
  device_ = 0x6c54370,
  size_ = 2048000,
  address_ = 81988384,
  page_size_ = 64000,
  buffer_type_ = tt::tt_metal::BufferType::DRAM,
  buffer_layout_ = tt::tt_metal::TensorMemoryLayout::INTERLEAVED,
  shard_parameters_ = std::optional<tt::tt_metal::ShardSpecBuffer> [no contained value]
}

Not strictly >64KB ^^

The buffer has 32 pages.

yieldthought commented 4 weeks ago

Hmmm.....

1530            bool write_partial_pages = padded_page_size > max_data_sizeB;
1531            uint32_t page_size_to_write = padded_page_size;
1532            uint32_t padded_buffer_size = buffer.num_pages() * padded_page_size;
1533            if (write_partial_pages) {
1534                TT_FATAL(buffer.num_pages() == 1, "TODO: add support for multi-paged buffer with page size > 64KB");
1535                uint32_t partial_size = dispatch_constants::BASE_PARTIAL_PAGE_SIZE;
1536                while (padded_buffer_size % partial_size != 0) {
1537                    partial_size += PCIE_ALIGNMENT;
1538                }
(gdb) p total_pages_to_write
$2 = 32
(gdb) p max_data_sizeB
$3 = 32704
(gdb) p padded_page_size
$4 = 64000
tt-asaigal commented 3 weeks ago

Wait, are you saying that running tensor.argmax on a tensor that exists on device to get an output on device performs synchronization with the host?

Hey Mark, apologies for the late response. argmax is implemented as a composite op in TTNN. This means that its a sequence of more primitive ops that get sequentially scheduled on the device. Please see composite_ops.cpp::_argmax for the implementation details. There should be no device <--> host synchronization here, although host overhead is significantly increased due to the number of ops being sent to device.

Wrt the assert you're seeing - I'm not entirely sure what's causing this. @pgkeller any ideas what this means, and why it may be happening? The buffer size and config are mentioned above.

P.S. argmax is currently broken on main and is being tracked at https://github.com/tenstorrent/tt-metal/issues/9093. Please revert the commit in this issue for testing/debug.