tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
430 stars 58 forks source link

Segfault when running matmul on a deallocated tensor #5377

Closed yieldthought closed 7 months ago

yieldthought commented 7 months ago

Describe the bug In operations:primary::MatMul::validate the order of these lines is incorrect:

576     TT_FATAL(input_tensor_a.device() == input_tensor_b.device(), "Operands to matmul need to be on the same device!");
577     TT_FATAL(input_tensor_a.buffer() != nullptr and input_tensor_b.buffer() != nullptr, "Operands to matmul need to be 
allocated in buffers on device!");

Calling tensor.device() with a null buffer will segfault.

To Reproduce

import ttnn
import torch
device = ttnn.open(0)
x = ttnn.from_torch(torch.randn((4, 4), dtype=torch.bfloat16), layout=ttnn.Layout.TILE, device=device)
w = ttnn.from_torch(torch.randn((4, 4), dtype=torch.bfloat16), layout=ttnn.Layout.TILE, device=device)
x.value.deallocate()
x @ w

Note: you will need to reset the device after this crash.

Expected behavior A TT_FATAL that input a has a null buffer (or something like that)

Screenshots

>>> x @ w
                     Op | INFO     | Finished Operation ttnn.reshape                                       in         2465545 nanoseconds
                     Op | INFO     | Finished Operation ttnn.reshape                                       in         1042518 nanoseconds
Segmentation fault (core dumped)

Please complete the following environment information:

Additional context This might affect other ops that use this assert order too.

yieldthought commented 7 months ago

Yep several ops are affected, I'll fix them all:

$ grep -A1 -r 'TT_FATAL(input_tensor_a.device() == input_tensor_b.device()' *
tt_eager/tt_dnn/op_library/transformer_tms/transformer_tms.cpp:    TT_FATAL(input_tensor_a.device() == input_tensor_b.device(), "Operands to matmul need to be on the same device!");
tt_eager/tt_dnn/op_library/transformer_tms/transformer_tms.cpp-    TT_FATAL(input_tensor_a.buffer() != nullptr and input_tensor_b.buffer() != nullptr, "Operands to matmul need to be allocated in buffers on device!");
--
tt_eager/tt_dnn/op_library/transformer_tms/transformer_tms.cpp:    TT_FATAL(input_tensor_a.device() == input_tensor_b.device(), "Operands to matmul need to be on the same device!");
tt_eager/tt_dnn/op_library/transformer_tms/transformer_tms.cpp-    TT_FATAL(input_tensor_a.buffer() != nullptr and input_tensor_b.buffer() != nullptr, "Operands to matmul need to be allocated in buffers on device!");
--
tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:    TT_FATAL(input_tensor_a.device() == input_tensor_b.device(), "Operands to matmul need to be on the same device!");
tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp-    TT_FATAL(input_tensor_a.buffer() != nullptr and input_tensor_b.buffer() != nullptr, "Operands to matmul need to be allocated in buffers on device!");
--
tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp:    TT_FATAL(input_tensor_a.device() == input_tensor_b.device(), "Operands to matmul need to be on the same device!");
tt_eager/tt_dnn/op_library/bmm/bmm_op.cpp-    TT_FATAL(input_tensor_a.buffer() != nullptr and input_tensor_b.buffer() != nullptr, "Operands to matmul need to be allocated in buffers on device!");
--
tt_eager/tt_dnn/op_library/eltwise_binary/eltwise_binary_op.cpp:    TT_FATAL(input_tensor_a.device() == input_tensor_b.device(), "Operands to eltwise binary need to be on the same device!");
tt_eager/tt_dnn/op_library/eltwise_binary/eltwise_binary_op.cpp-    TT_FATAL(input_tensor_a.buffer() != nullptr and input_tensor_b.buffer() != nullptr, "Operands to eltwise binary need to be allocated in buffers on device!");
--
tt_eager/tt_dnn/op_library/bcast/bcast_op.cpp:    TT_FATAL(input_tensor_a.device() == input_tensor_b.device(), "Operands to bcast need to be on the same device!");
tt_eager/tt_dnn/op_library/bcast/bcast_op.cpp-    TT_FATAL(input_tensor_a.buffer() != nullptr and input_tensor_b.buffer() != nullptr, "Operands to bcast need to be allocated in buffers on device!");