Closed cglagovichTT closed 2 months ago
@ayerofieiev-tt could you please help me assign to the right person?
Hey @cglagovichTT could you point out where you were referring to that used uint32t in allocator path? That's a little scary since we're pretty close to that limit for device memory banks (especially if we ever decide to merge address ranges across banks for future hw archs)
tensor.cpp
I left comments in this code section where I believe we have the wrong dataformats.
Tensor create_device_tensor(
const Shape& shape, DataType data_type, Layout layout, Device* device, const MemoryConfig& memory_config) {
ZoneScoped;
GraphTracker::instance().track_function_start("tt::tt_metal::create_device_tensor", shape, data_type, layout, device, memory_config);
if (memory_config.is_sharded()) {
TT_ASSERT(memory_config.shard_spec.has_value());
auto& shard_spec = memory_config.shard_spec.value();
auto& shard_shape = shard_spec.shape;
auto width = shape[-1];
auto other_dims = 1; // <---- This is likely inferred as int, so it would overflow for large tensor volumes
for (int i = 0; i < shape.rank() - 1; i++) {
other_dims *= shape[i];
}
auto element_size = tensor_impl::element_size_bytes(data_type);
auto page_shape = tensor_impl::get_sharded_page_shape(layout, data_type, shard_spec.shape);
std::array<uint32_t, 2> tensor2d_size = {other_dims / page_shape[0], width / page_shape[1]};
ShardSpecBuffer shard_spec_buffer(shard_spec, page_shape, tensor2d_size);
uint32_t packed_size_in_bytes = // <--- size bytes is uint32_t
tensor_impl::packed_buffer_size_bytes_wrapper(data_type, compute_buffer_size(shape, data_type));
auto device_buffer = tensor_impl::allocate_buffer_on_device(
packed_size_in_bytes, device, shape, data_type, layout, memory_config, shard_spec_buffer);
auto output = Tensor(DeviceStorage{device_buffer}, shape, data_type, layout);
output = tt::tt_metal::set_tensor_id(output);
GraphTracker::instance().track_function_end(output);
return output;
} else {
uint32_t packed_size_in_bytes =
tensor_impl::packed_buffer_size_bytes_wrapper(data_type, compute_buffer_size(shape, data_type));
auto device_buffer = tensor_impl::allocate_buffer_on_device(
packed_size_in_bytes, device, shape, data_type, layout, memory_config);
auto output = Tensor(DeviceStorage{device_buffer}, shape, data_type, layout);
output = tt::tt_metal::set_tensor_id(output);
GraphTracker::instance().track_function_end(output);
return output;
}
}
tensor_impl_wrapper.hpp
Takes buffer_size_bytes as a uint32_t and returns uint32_t.
inline uint32_t packed_buffer_size_bytes_wrapper(DataType dtype, uint32_t volume_unpacked_data) {
return dispatch(dtype, AS_LAMBDA(packed_buffer_size_bytes), volume_unpacked_data);
}
Note that I tried addressing a few of my own comments in these areas where I see potential issues, but that did not make my test pass.
Tempted to assign this P0 priority because the workarounds in the model are not great - they make demo code more complex and less robust. Is anyone assigned to this issue?
@cglagovichTT I pushed a change to cglagovich/12096
please let me know if this works for you and then Ill open up a PR to merge this to main
@abhullar-tt if it makes the unit test pass then that's a success! I'd say it's good to merge
When running Llama3 prefill on TG with a sequence length > 32k, we see an int overflow when the buffer for the lm head output is allocated.
Issue
I see the problem in
line_all_gather
when the op allocates its output. It appears that tt-metal thinks my 4GB tensor on device contains 0 bytes. In buffer allocation code, some sizes aresize_t
and some areuint32_t
, which is likely the issue. I was also able to repro in a unit test on a single device.Repro
branch
cglagovich/12096
Error
When we see the error in our model tests, the following is the error message.