tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
396 stars 49 forks source link

The metaiium's preprocess_model does not work on my side #11799

Open zhongpanwu opened 3 weeks ago

zhongpanwu commented 3 weeks ago

Describe the bug I am following the tutorial to run the restnet basic block code from here.

However, the result always shows me the segfault problem after calling function preprocess_model (I use python debugger to see the problem is from function _initialize_model_and_preprocess_parameters, I cannot debug more as the debugger drives the device's temperature go crazy...)

sometimes, the error shows the problem from math.hpp: Here is the screenshot

2024-08-22 17:01:51.959 | WARNING  | ttnn.model_preprocessing:_initialize_model_and_preprocess_parameters:505 - Putting the model in eval mode
2024-08-22 17:01:52.032 | DEBUG    | ttnn.operations.conv.tt_py_composite_conv:determine_parallel_config:208 - PARALLEL CONFIG :: True :: 64 :: 64 :: SlidingWindowOpParams(stride_h=1, stride_w=1, pad_h=1, pad_w=1, window_h=3, window_w=3, batch_size=8, input_h=56, input_w=56) :: {} -> 98 :: [12, 9] :: 8 :: 2
Traceback (most recent call last):
  File "tutorial_6.py", line 85, in <module>
    parameters = preprocess_model(
  File "/home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/model_preprocessing.py", line 710, in preprocess_model
    return from_torch(
  File "/home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/model_preprocessing.py", line 577, in from_torch
    model = _initialize_model_and_preprocess_parameters(
  File "/home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/model_preprocessing.py", line 509, in _initialize_model_and_preprocess_parameters
    parameters = convert_torch_model_to_ttnn_model(
  File "/home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/model_preprocessing.py", line 286, in convert_torch_model_to_ttnn_model
    child_parameters = convert_torch_model_to_ttnn_model(
  File "/home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/model_preprocessing.py", line 259, in convert_torch_model_to_ttnn_model
    default_preprocessor_parameters = default_preprocessor(model, name, ttnn_module_args)
  File "/home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/model_preprocessing.py", line 195, in default_preprocessor
    parameters = preprocess_conv2d(model.weight, model.bias, ttnn_module_args)
  File "/home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/model_preprocessing.py", line 56, in preprocess_conv2d
    conv = ttnn.Conv2d(
  File "/home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/operations/conv2d.py", line 126, in __init__
    self.conv = TTPyCompositeConv(
  File "/home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/operations/conv/tt_py_composite_conv.py", line 580, in __init__
    self.tt_py_untilize_with_halo_op = TTPyUntilizeWithHalo(
  File "/home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/operations/conv/tt_py_untilize_with_halo.py", line 41, in __init__
    self.set_op_configs(
  File "/home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/operations/conv/tt_py_untilize_with_halo.py", line 213, in set_op_configs
    padding_config_tensor = gen_per_core_gather_data_uint16_tensor(padding_config)
  File "/home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/operations/conv/tt_py_untilize_with_halo.py", line 180, in gen_per_core_gather_data_uint16_tensor
    tt_tensor = tt_tensor.to(device, get_memory_config(shard_shape)) if device is not None else tt_tensor
RuntimeError: TT_FATAL @ ../tt_metal/common/math.hpp:16: b > 0
backtrace:
 --- tt::tt_metal::allocator::BankManager::allocate_buffer(unsigned int, unsigned int, bool, tt::umd::xy_pair, std::__1::optional<unsigned int>)
 --- tt::tt_metal::allocator::base_alloc(tt::tt_metal::AllocatorConfig const&, tt::tt_metal::allocator::BankManager&, unsigned long, unsigned long, bool, std::__1::optional<unsigned int>)
 --- tt::tt_metal::allocator::allocate_buffer(tt::tt_metal::Allocator&, unsigned int, unsigned int, tt::tt_metal::BufferType const&, bool, std::__1::optional<unsigned int>)
 --- tt::tt_metal::EnqueueAllocateBufferImpl(tt::tt_metal::AllocBufferMetadata)
 --- tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&)
 --- tt::tt_metal::EnqueueAllocateBuffer(tt::tt_metal::CommandQueue&, tt::tt_metal::Buffer*, bool, bool)
 --- tt::tt_metal::detail::AllocateBuffer(tt::tt_metal::Buffer*, bool)
 --- tt::tt_metal::Buffer::allocate()
 --- tt::tt_metal::Buffer::Buffer(tt::tt_metal::Device*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, std::__1::optional<bool>, bool)
 --- /home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/_ttnn.so(+0xc75068) [0x7f96f20d0068]
 --- tt::tt_metal::tensor_impl::detail::allocate_sharded_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::ShardSpecBuffer const&, tt::tt_metal::MemoryConfig const&)
 --- tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned int, tt::tt_metal::Device*, tt::tt_metal::Shape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&)
 --- /home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/_ttnn.so(_ZN2tt8tt_metal11tensor_impl25initialize_data_on_deviceItNS0_15borrowed_buffer6BufferEEENSt3__110shared_ptrINS0_6BufferEEERT0_IT_EPNS0_6DeviceERKNS0_5ShapeENS0_8DataTypeENS0_6LayoutERKNS0_12MemoryConfigERKNS5_8optionalINS0_15ShardSpecBufferEEENSN_INS5_17reference_wrapperINS0_12CommandQueueEEEEE+0x2a) [0x7f96f20d38ea]
 --- /home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/_ttnn.so(+0xc786df) [0x7f96f20d36df]
 --- /home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/_ttnn.so(_ZN2tt8tt_metal11tensor_impl16to_device_bufferItEENSt3__110shared_ptrINS0_6BufferEEERKNS3_7variantIJNS0_12OwnedStorageENS0_13DeviceStorageENS0_15BorrowedStorageENS0_22MultiDeviceHostStorageENS0_18MultiDeviceStorageEEEEPNS0_6DeviceERKNS0_5ShapeENS0_8DataTypeENS0_6LayoutERKNS0_12MemoryConfigERKNS3_8optionalINS0_15ShardSpecBufferEEENSQ_INS3_17reference_wrapperINS0_12CommandQueueEEEEE+0x15a) [0x7f96f202b51a]
 --- tt::tt_metal::Tensor tt::tt_metal::tensor_impl::to_device<unsigned short>(tt::tt_metal::Tensor const&, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&, std::__1::optional<std::__1::reference_wrapper<tt::tt_metal::CommandQueue>>)
 --- auto tt::tt_metal::tensor_impl::dispatch<auto tt::tt_metal::tensor_impl::to_device_wrapper<tt::tt_metal::Tensor&, tt::tt_metal::Device*&, tt::tt_metal::MemoryConfig const&, std::__1::nullopt_t const&>(tt::tt_metal::Tensor&, tt::tt_metal::Device*&, tt::tt_metal::MemoryConfig const&, std::__1::nullopt_t const&)::'lambda'<typename $T>(auto&&...), tt::tt_metal::Tensor&, tt::tt_metal::Device*&, tt::tt_metal::MemoryConfig const&, std::__1::nullopt_t const&>(tt::tt_metal::DataType, auto tt::tt_metal::tensor_impl::to_device_wrapper<tt::tt_metal::Tensor&, tt::tt_metal::Device*&, tt::tt_metal::MemoryConfig const&, std::__1::nullopt_t const&>(tt::tt_metal::Tensor&, tt::tt_metal::Device*&, tt::tt_metal::MemoryConfig const&, std::__1::nullopt_t const&)::'lambda'<typename $T>(auto&&...)&&, auto&&...)
 --- /home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/_ttnn.so(+0xc8f290) [0x7f96f20ea290]
 --- tt::tt_metal::Tensor::to(tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&) const
 --- /home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/_ttnn.so(+0x17b8e8c) [0x7f96f2c13e8c]
 --- /home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/_ttnn.so(+0x17b8d5e) [0x7f96f2c13d5e]
 --- /home/zhongpan/Desktop/zp/Tenstorrent/tt-metal/ttnn/ttnn/_ttnn.so(+0xf1445e) [0x7f96f236f45e]
 --- python(PyCFunction_Call+0x59) [0x5f58f9]
 --- python(_PyObject_MakeTpCall+0x29e) [0x5f64ce]
 --- python() [0x50b4b3]
 --- python(_PyEval_EvalFrameDefault+0x5777) [0x570b67]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x569dfa]
 --- python(_PyFunction_Vectorcall+0x393) [0x5f5e83]
 --- python(_PyEval_EvalFrameDefault+0x725) [0x56bb15]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x569dfa]
 --- python(_PyFunction_Vectorcall+0x393) [0x5f5e83]
 --- python(_PyEval_EvalFrameDefault+0x907) [0x56bcf7]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x569dfa]
 --- python(_PyFunction_Vectorcall+0x393) [0x5f5e83]
 --- python() [0x59c010]
 --- python(_PyObject_MakeTpCall+0x1ff) [0x5f642f]
 --- python(_PyEval_EvalFrameDefault+0x62fd) [0x5716ed]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x569dfa]
 --- python(_PyFunction_Vectorcall+0x393) [0x5f5e83]
 --- python() [0x59c010]
 --- python(_PyObject_MakeTpCall+0x1ff) [0x5f642f]
 --- python(_PyEval_EvalFrameDefault+0x62fd) [0x5716ed]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x569dfa]
 --- python(_PyFunction_Vectorcall+0x393) [0x5f5e83]
 --- python() [0x59c010]
 --- python() [0x5a6ca7]
 --- python(PyObject_Call+0x25e) [0x5f526e]
 --- python(_PyEval_EvalFrameDefault+0x1f2d) [0x56d31d]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x569dfa]
 --- python(_PyFunction_Vectorcall+0x393) [0x5f5e83]
 --- python(_PyEval_EvalFrameDefault+0x725) [0x56bb15]
 --- python(_PyFunction_Vectorcall+0x1b6) [0x5f5ca6]
 --- python(_PyEval_EvalFrameDefault+0x725) [0x56bb15]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x569dfa]
 --- python(_PyFunction_Vectorcall+0x393) [0x5f5e83]
 --- python(_PyEval_EvalFrameDefault+0x1905) [0x56ccf5]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x569dfa]
 --- python(_PyFunction_Vectorcall+0x393) [0x5f5e83]
 --- python(_PyEval_EvalFrameDefault+0x1905) [0x56ccf5]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x569dfa]
 --- python(_PyFunction_Vectorcall+0x393) [0x5f5e83]
 --- python(_PyEval_EvalFrameDefault+0x1905) [0x56ccf5]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x569dfa]
 --- python(_PyFunction_Vectorcall+0x393) [0x5f5e83]
 --- python(_PyEval_EvalFrameDefault+0x1905) [0x56ccf5]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x569dfa]
 --- python(_PyFunction_Vectorcall+0x393) [0x5f5e83]
 --- python(_PyEval_EvalFrameDefault+0x1905) [0x56ccf5]
 --- python(_PyEval_EvalCodeWithName+0x26a) [0x569dfa]
 --- python(PyEval_EvalCode+0x27) [0x68ce77]
 --- python() [0x67e631]
 --- python() [0x67e6af]
 --- python() [0x67e751]
 --- python(PyRun_SimpleFileExFlags+0x197) [0x67f3e7]
 --- python(Py_RunMain+0x212) [0x6b67c2]
 --- python(Py_BytesMain+0x2d) [0x6b6b4d]
 --- /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f97235fb083]
 --- python(_start+0x2e) [0x5fa81e]

                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                 Device | INFO     | Closing user mode device drivers

OS: 20.04 Ubuntu device: gs e150 FW: 80.10.0.0

Thanks in advance for any help.

zhongpanwu commented 3 weeks ago

sorry, segfault screenshot is attached below:

2024-08-22 23:29:27.348 | WARNING  | ttnn.decorators:operation_decorator:792 - Should ttnn.Conv1d be migrated to C++?
                 Device | INFO     | Opening user mode device driver
2024-08-22 23:29:27.373 | INFO     | SiliconDriver   - Detected 1 PCI device : [0]
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1202 MHz
2024-08-22 23:29:28.914 | DEBUG    | ttnn:manage_config:90 - Set ttnn.CONFIG.enable_logging to False
2024-08-22 23:29:28.914 | DEBUG    | ttnn:manage_config:90 - Set ttnn.CONFIG.enable_comparison_mode to False
2024-08-22 23:29:28.915 | WARNING  | ttnn.model_preprocessing:from_torch:560 - ttnn: model cache can be enabled by passing model_name argument to preprocess_model[_parameters]
2024-08-22 23:29:28.915 | WARNING  | ttnn.model_preprocessing:_initialize_model_and_preprocess_parameters:505 - Putting the model in eval mode
2024-08-22 23:29:28.998 | DEBUG    | ttnn.operations.conv.tt_py_composite_conv:determine_parallel_config:208 - PARALLEL CONFIG :: True :: 64 :: 64 :: SlidingWindowOpParams(stride_h=1, stride_w=1, pad_h=1, pad_w=1, window_h=3, window_w=3, batch_size=8, input_h=56, input_w=56) :: {} -> 98 :: [12, 9] :: 8 :: 2
Segmentation fault (core dumped)
ayerofieiev-tt commented 3 weeks ago

@zhongpanwu , thank you for the report! We will check and follow up tomorrow