Open swimdi opened 3 weeks ago
swin_b
also has similar issue with aten.masked_fill.Scalar
, the err msg is
self = FastOperation(python_fully_qualified_name='ttnn.to_device', function=<built-in method to_device of PyCapsule object at...function default_postprocess_golden_function_outputs at 0x7fe6d1be9d30>, is_cpp_operation=False, is_experimental=False)
function_args = (ttnn.Tensor([[[ 0.00000, 0.00000, ..., 0.00000, 0.00000],
[ 0.00000, 0.00000, ..., 0.00000, 0.0...64, 49, 49]), dtype=DataType::BFLOAT16, layout=Layout::ROW_MAJOR), <ttnn._ttnn.device.Device object at 0x7fe69fa929f0>)
function_kwargs = {'memory_config': MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)}
def __call__(self, *function_args, **function_kwargs):
> return self.function(*function_args, **function_kwargs)
E RuntimeError: TT_FATAL @ /tmp/build-via-sdist-n2_feo7u/metal_libs-0.53.0rc28+wormhole.b0/tt_metal/impl/buffers/buffer.cpp:52: page_size % sizeof(uint32_t) == 0
The reproduce step is
to_tt_guard.py
["Tensor<[64, 49, 49]> self = ?", "Tensor<[64, 49, 49]> mask = ?", "number value = -100.0"],
[("swin_b", "Swin_B_Weights"), "eval"],
in test_torchvision_image_classification.py
pytest tests/models/torchvision/test_torchvision_image_classification.pyy
From the dump of model IR, I think it is because we are trying to load an (1, 1, 19, 19)
tensor to device in ROW_MAJOR layout, which FWIW is not supported in ttnn because each page needs to align to sizeof(int32_t)
and in ROW_MAJOR layout each row is one page (so we are trying to create with a page size 38 bytes).
ttnn_from_torch_3 = ttnn_decorators_ttnn_from_torch(arg390_1, layout = ttnn_ROW_MAJOR_LAYOUT, dtype = ttnn_bfloat16); arg390_1 = None
ttnn_reshape_3 = ttnn_decorators_ttnn_reshape(ttnn_from_torch_3, (1, 1, 19)); ttnn_from_torch_3 = None
ttnn_from_device_1 = ttnn_decorators_ttnn_from_device(ttnn_reshape_3); ttnn_reshape_3 = None
ttnn_to_layout_1 = ttnn_decorators_ttnn_to_layout(ttnn_from_device_1, ttnn_ROW_MAJOR_LAYOUT); ttnn_from_device_1 = None
ttnn_reshape_4 = ttnn_decorators_ttnn_reshape(ttnn_to_layout_1, (1, 1, 1, 19)); ttnn_to_layout_1 = None
ttnn_to_torch_1 = ttnn_decorators_ttnn_to_torch(ttnn_reshape_4); ttnn_reshape_4 = None
expand_default_1 = torch.ops.aten.expand.default(ttnn_to_torch_1, [1, 1, 19, 19]); ttnn_to_torch_1 = None
rsub_scalar = torch.ops.aten.rsub.Scalar(expand_default_1, 1.0); expand_default_1 = None
_to_copy_default_2 = torch.ops.aten._to_copy.default(rsub_scalar, dtype = torch.bool)
ttnn_from_torch_4 = ttnn_decorators_ttnn_from_torch(_to_copy_default_2, layout = ttnn_ROW_MAJOR_LAYOUT, dtype = ttnn_bfloat16, device = ttnn_Specified_Device); _to_copy_default_2 = None
Note that before the last ttnn_from_torch_4 = ttnn_decorators_ttnn_from_torch
, all tensors above I think are on host instead of device.
Here is a simple program to reproduce this issue:
import torch
import ttnn
def main(device):
torch_tensor = torch.rand((1, 1, 19, 19), dtype=torch.bfloat16)
input_tensor = ttnn.from_torch(torch_tensor, layout=ttnn.ROW_MAJOR_LAYOUT, device=device)
print(input_tensor)
if __name__ == "__main__":
device = ttnn.open_device(device_id=0)
try:
main(device)
finally:
ttnn.close_device(device)
It gave me the error:
RuntimeError: TT_FATAL @ ../ttnn/cpp/ttnn/tensor/layout/page_config.cpp:121: (widthAlignment % page_alignment) == 0
info:
Wrong custom Tensor Layout alignment Alignment([19, 19, 19, 19]). For Row Major layout with element size 2bytes the innermost dimension must align to 2. This is because Buffer data is packed as uint32_t (4 bytes).
Hi @ayerofieiev-tt , do you know if there is a proper way to create a device tensor with an odd inner-most dim (e.g. (19, 19)
) in row major layout? I think we encounter this issue in multiple models. Currently I know we can create such tensors in tile layout, but in some places it might not be feasible or easy to do
When I debug
XGLM
model test, has this error messageIf I block
aten.masked_fill.Scalar
of input varations["Tensor<[1, 1, 19, 19]> self = ?", "Tensor<[1, 1, 19, 19]> mask = ?", "number value = -3.3895313892515355e+38"]
, then above err msg not shown, guess there's something wrong with masked_fill loweringThe reproduce step is
to_tt_guard.py
pytest tests/models/xglm/test_xglm.py
After this issue is resolved, please also remove the related blocklist in
to_tt_guard.py
.