cglagovichTT commented 2 months ago

When running Llama3 prefill on TG with a sequence length > 32k, we see an int overflow when the buffer for the lm head output is allocated.

Issue

I see the problem in line_all_gather when the op allocates its output. It appears that tt-metal thinks my 4GB tensor on device contains 0 bytes. In buffer allocation code, some sizes are size_t and some are uint32_t, which is likely the issue. I was also able to repro in a unit test on a single device.

Repro

branch cglagovich/12096

pytest -svv tests/ttnn/unit_tests/test_as_tensor.py::test_as_tensor

@pytest.mark.parametrize("height", [2**15])
@pytest.mark.parametrize("width", [2**16])
def test_as_tensor(device, height, width):

    memory_config = ttnn.DRAM_MEMORY_CONFIG

    reshard_out = ttnn.allocate_tensor_on_device(
        ttnn.Shape([1, 1, height, width]),
        ttnn.bfloat16,
        ttnn.TILE_LAYOUT,
        device,
        memory_config,
    )

Error

tests/ttnn/unit_tests/test_as_tensor.py:23:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = FastOperation(python_fully_qualified_name='ttnn.allocate_tensor_on_device', function=<built-in method allocate_tensor_...function default_postprocess_golden_function_outputs at 0x7f684fd0a550>, is_cpp_operation=False, is_experimental=False), function_args = (ttnn.Shape([1, 1, 32768, 65536]), <DataType.BFLOAT16: 0>, <Layout.TILE: 1>, <ttnn._ttnn.device.Device object at 0x7f6...d70>, MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)), function_kwargs = {}

    def __call__(self, *function_args, **function_kwargs):
>       return self.function(*function_args, **function_kwargs)
E       RuntimeError: TT_FATAL @ ../tt_metal/impl/buffers/buffer.cpp:31: size != 0 and page_size != 0
E       info:
E       Buffer size and page size should be larger than 0 bytes!
E       backtrace:
E        --- tt::tt_metal::validate_buffer_size_and_page_size(unsigned long, unsigned long, tt::tt_metal::BufferType const&, tt::tt_metal::TensorMemoryLayout const&, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&)
E        --- tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, std::__1::optional<bool>, bool)
E        --- /proj_sw/user_dev/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so(+0xeacbee) [0x7f6851c61bee]
E        --- tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&)
E        --- tt::tt_metal::create_device_tensor(tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&)
E        --- /proj_sw/user_dev/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so(+0xec0100) [0x7f6851c75100]
E        --- tt::tt_metal::allocate_tensor_on_device(ttnn::types::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&)
E        --- ttnn::operations::core::allocate_tensor_on_device(ttnn::types::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, std::__1::optional<tt::tt_metal::MemoryConfig> const&)
E        --- /proj_sw/user_dev/cglagovich/tt-metal/ttnn/ttnn/_ttnn.so(+0x1195fd9) [0x7f6851f4afd9]

When we see the error in our model tests, the following is the error message.

models/demos/tg/llama3_70b/tt/llama_common.py:12: in tt_all_reduce
    gathered_tensor = ttnn.line_all_gather(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = FastOperation(python_fully_qualified_name='ttnn.line_all_gather', function=<ttnn._ttnn.operations.ccl.line_all_gather_...uts=None, golden_function=None, postprocess_golden_function_outputs=None, is_cpp_operation=True, is_experimental=False)
function_args = (device_id:11
ttnn.Tensor([[[[ 0.57031, -0.04004,  ..., -1.10938,  0.26758],
               [-0.23047,  0.68359,  ...,...0,  ...,  0.00000,  0.00000]]]], shape=Shape([1, 1, 32768, 16384]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
, 0)
function_kwargs = {'cluster_axis': 1, 'device_mesh': <ttnn._ttnn.multi_device.DeviceMesh object at 0x7f8ec8734330>, 'num_links': 2}

    def __call__(self, *function_args, **function_kwargs):
>       return self.function(*function_args, **function_kwargs)
E       RuntimeError: TT_FATAL @ ../tt_metal/impl/buffers/buffer.cpp:31: size != 0 and page_size != 0
E       info:
E       Buffer size and page size should be larger than 0 bytes!

cglagovichTT commented 2 months ago

@ayerofieiev-tt could you please help me assign to the right person?

SeanNijjar commented 2 months ago

Hey @cglagovichTT could you point out where you were referring to that used uint32t in allocator path? That's a little scary since we're pretty close to that limit for device memory banks (especially if we ever decide to merge address ranges across banks for future hw archs)

cglagovichTT commented 2 months ago

tensor.cpp I left comments in this code section where I believe we have the wrong dataformats.

Tensor create_device_tensor(
    const Shape& shape, DataType data_type, Layout layout, Device* device, const MemoryConfig& memory_config) {
    ZoneScoped;
    GraphTracker::instance().track_function_start("tt::tt_metal::create_device_tensor", shape, data_type, layout, device, memory_config);
    if (memory_config.is_sharded()) {
        TT_ASSERT(memory_config.shard_spec.has_value());

        auto& shard_spec = memory_config.shard_spec.value();
        auto& shard_shape = shard_spec.shape;

        auto width = shape[-1];
        auto other_dims = 1; // <---- This is likely inferred as int, so it would overflow for large tensor volumes
        for (int i = 0; i < shape.rank() - 1; i++) {
            other_dims *= shape[i];
        }

        auto element_size = tensor_impl::element_size_bytes(data_type);
        auto page_shape = tensor_impl::get_sharded_page_shape(layout, data_type, shard_spec.shape);
        std::array<uint32_t, 2> tensor2d_size = {other_dims / page_shape[0], width / page_shape[1]};
        ShardSpecBuffer shard_spec_buffer(shard_spec, page_shape, tensor2d_size);
        uint32_t packed_size_in_bytes = // <--- size bytes is uint32_t
            tensor_impl::packed_buffer_size_bytes_wrapper(data_type, compute_buffer_size(shape, data_type));
        auto device_buffer = tensor_impl::allocate_buffer_on_device(
            packed_size_in_bytes, device, shape, data_type, layout, memory_config, shard_spec_buffer);

        auto output = Tensor(DeviceStorage{device_buffer}, shape, data_type, layout);
        output = tt::tt_metal::set_tensor_id(output);
        GraphTracker::instance().track_function_end(output);
        return output;
    } else {
        uint32_t packed_size_in_bytes =
            tensor_impl::packed_buffer_size_bytes_wrapper(data_type, compute_buffer_size(shape, data_type));
        auto device_buffer = tensor_impl::allocate_buffer_on_device(
            packed_size_in_bytes, device, shape, data_type, layout, memory_config);
        auto output = Tensor(DeviceStorage{device_buffer}, shape, data_type, layout);
        output = tt::tt_metal::set_tensor_id(output);
        GraphTracker::instance().track_function_end(output);
        return output;
    }
}

tensor_impl_wrapper.hpp Takes buffer_size_bytes as a uint32_t and returns uint32_t.

inline uint32_t packed_buffer_size_bytes_wrapper(DataType dtype, uint32_t volume_unpacked_data) {
    return dispatch(dtype, AS_LAMBDA(packed_buffer_size_bytes), volume_unpacked_data);
}

Note that I tried addressing a few of my own comments in these areas where I see potential issues, but that did not make my test pass.

cglagovichTT commented 2 months ago

Tempted to assign this P0 priority because the workarounds in the model are not great - they make demo code more complex and less robust. Is anyone assigned to this issue?

abhullar-tt commented 2 months ago

@cglagovichTT I pushed a change to cglagovich/12096 please let me know if this works for you and then Ill open up a PR to merge this to main

cglagovichTT commented 2 months ago

@abhullar-tt if it makes the unit test pass then that's a success! I'd say it's good to merge

tenstorrent / tt-metal

Int overflow in create_device_tensor #12096

Issue

Repro

Error