tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
300 stars 23 forks source link

Unable to handle large tensors #4004

Open arakhmati opened 7 months ago

arakhmati commented 7 months ago

Unable to move large tensors to device. In other cases if the large tensors get created from operations we are unable to move them to host.

def test_large_slicing(device):
    torch_a = torch.rand((1, 1, 42, 250880), dtype=torch.bfloat16)
    torch_output = torch_a[:, :, -1, :]
    a = ttnn.from_torch(torch_a)
    a = ttnn.to_device(a, device)
    tt_output = a[:, :, -1, :]
    tt_output = ttnn.from_device(tt_output)
    tt_output = ttnn.to_torch(tt_output)
    assert_with_pcc(torch_output, tt_output, 0.9999)

Large tensor moving to host with ttl_tensor.cpu causes...


Exception has occurred: RuntimeError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
TT_ASSERT @ tt_metal/impl/dispatch/command_queue.cpp:317: padded_page_size <= consumer_cb_size
info:
Page is too large to fit in consumer buffer
backtrace:
 --- void tt::assert::tt_assert<char [44]>(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, char const*, char const (&) [44])
 --- tt::tt_metal::EnqueueReadBufferCommand::assemble_device_command(unsigned int)
 --- tt::tt_metal::EnqueueReadBufferCommand::process()
 --- tt::tt_metal::CommandQueue::enqueue_command(tt::tt_metal::Command&, bool)
 --- tt::tt_metal::CommandQueue::enqueue_read_buffer(tt::tt_metal::Buffer&, std::vector<unsigned int, std::allocator<unsigned int> >&, bool)
 --- tt::tt_metal::EnqueueReadBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer&, std::vector<unsigned int, std::allocator<unsigned int> >&, bool)
 --- std::vector<bfloat16, std::allocator<bfloat16> > tt::tt_metal::tensor_impl::read_data_from_device<bfloat16>(tt::tt_metal::Tensor const&, unsigned int)
 --- /home/ubuntu/git/tt-metal/tt_eager/tt_lib/_C.so(+0x925f65) [0x7f1665e60f65]
 --- std::_Function_handler<tt::tt_metal::Tensor (tt::tt_metal::Tensor const&), tt::tt_metal::Tensor (*)(tt::tt_metal::Tensor const&)>::_M_invoke(std::_Any_data const&, tt::tt_metal::Tensor const&)
 --- std::function<tt::tt_metal::Tensor (tt::tt_metal::Tensor const&)>::operator()(tt::tt_metal::Tensor const&) const
 --- tt::tt_metal::tensor_impl::to_host_wrapper(tt::tt_metal::Tensor const&)
 --- tt::tt_metal::Tensor::cpu() const
jvasilje commented 6 months ago

@arakhmati is this blocking performant Bloom?

arakhmati commented 6 months ago

@jvasilje It's not blocking bloom for question answering but it does block bloom for causal lm which we aren't measuring right now

arakhmati commented 6 months ago

@DrJessop the error the we see in bloom is:

E       RuntimeError: TT_ASSERT @ tt_metal/impl/dispatch/command_queue.cpp:755: read_buffer_command_size <= DeviceCommand::HUGE_PAGE_SIZE - CQ_START
E       info:
E       EnqueueReadBuffer command is too large

To reproduce:

DrJessop commented 6 months ago

Two things

  1. Dispatch core is double-buffered, and we need to ensure that a page can fit in ~450KB region in the dispatch core's SRAM. If we're in row major, recall that the row size * sizeof(data_type) is the page size. Whereas, if we're in tile layout, our page size is shape-independent. A clean fix for this issue is not easy, but I can provide some ideas if this is something we know we want to support.
  2. We have a limitation that our data needs to fit inside a 1GB hugepage. If it does not, then we throw the error that Akhmed has seen (because the device command + data size > 1GB). A temporary fix would be to increase the page size, but a proper fix would be to potentially run multiple EnqueueReadBuffer commands that read out a subset of the desired output.

@davorchap @jvasilje I believe 2 to be much higher priority than 1, and I believe Moreh also wanted the ability to write/read given some offsets into/out of a buffer.

jvasilje commented 6 months ago

comment: this will be a blocker for Bloom in LLM mode, but not for Q&A Bloom variant.

davorchap commented 6 months ago

@DrJessop splitting a user buffer into multiple EnqeueReadBuffer commands under hood -- is this the feature we need to unblock?

DrJessop commented 6 months ago

@davorchap @abhullar-tt actually supports this in her completion queue PR.

davorchap commented 6 months ago

@davorchap @abhullar-tt actually supports this in her completion queue PR.

@abhullar-tt this would be great, let us know if your changes make this pass

abhullar-tt commented 6 months ago

The test runs but the expected test is not the same as generated text:

>       assert expected_generated_text == generated_text
E       assert 'Hello, my dog is cute and sweet. He loves to play with me and' == 'Hello, my dog is cute.\nong song"\n\n"?"?'
E         + Hello, my dog is cute and sweet. He loves to play with me and
E         - Hello, my dog is cute.
E         - ong song"
E         - 
E         - "?"?
arakhmati commented 6 months ago

The test runs but the expected test is not the same as generated text:

>       assert expected_generated_text == generated_text
E       assert 'Hello, my dog is cute and sweet. He loves to play with me and' == 'Hello, my dog is cute.\nong song"\n\n"?"?'
E         + Hello, my dog is cute and sweet. He loves to play with me and
E         - Hello, my dog is cute.
E         - ong song"
E         - 
E         - "?"?

I added a standalone unit test for the large matmul (on the same branch):

pytest "tests/ttnn/unit_tests/test_matmul.py::test_matmul_with_large_n"
abhullar-tt commented 6 months ago
pytest "tests/ttnn/unit_tests/test_matmul.py::test_matmul_with_large_n"

I rebased the branch and pushed earlier, i don't think the test was included. sorry do you mind pushing it again

arakhmati commented 6 months ago
```shell
pytest "tests/ttnn/unit_tests/test_matmul.py::test_matmul_with_large_n"

I rebased the branch and pushed earlier, i don't think the test was included. sorry do you mind pushing it again

It's on the branch. Can you run git pull --rebase? And/or do a hard reset

abhullar-tt commented 6 months ago
pytest "tests/ttnn/unit_tests/test_matmul.py::test_matmul_with_large_n"

ah this test fails because the page size (501760 B) is too large to fit into the dispatch core CB (440320 B)

jliangTT commented 3 months ago

@arakhmati , is there a current use case that are blocked by this?

arakhmati commented 3 months ago

As far as I know, this is only a problem in ttnn implementation of Bloom model for CausalLM application. I don't think there are other use cases that we have encountered

jliangTT commented 3 months ago

Given the limited use case, a bloom being less of a use case, i think we can downgrade from P1 to P3. Do you think differently?

arakhmati commented 3 months ago

Given the limited use case, a bloom being less of a use case, i think we can downgrade from P1 to P3. Do you think differently?

Yes, we can downgrade to P3