tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
303 stars 26 forks source link

GS watcher error with `bert_large_performant` #6487

Open TT-billteng opened 3 months ago

TT-billteng commented 3 months ago

TT_METAL_WATCHER=60 pytest models/experimental/bert_large_performant/unit_tests/test_bert_large_split_query_key_value_and_split_heads.py::test_split_query_key_value_and_split_heads -v

models.experimental.bert_large_performant.unit_tests.test_bert_large_split_query_key_value_and_split_heads:run_split_query_key_value_and_split_heads_test:52 - v: BufferType.L1 and DataType.BFLOAT8_B
2024-03-16T05:29:28.3021735Z               LLRuntime | INFO     | Watcher checking device 0
2024-03-16T05:29:28.3024248Z                  Always | INFO     | While running kernels:
2024-03-16T05:29:28.3026580Z                  Always | INFO     |  brisc : tt_eager/tt_dnn/op_library/transformer_tms/kernels/dataflow/writer_tm_tile_layout_create_qkv_heads.cpp
2024-03-16T05:29:28.3029402Z                  Always | INFO     |  ncrisc: tt_eager/tt_dnn/op_library/transformer_tms/kernels/dataflow/reader_tm_tile_layout_create_qkv_heads.cpp
2024-03-16T05:29:28.3031777Z                  Always | INFO     |  triscs: tt_eager/tt_dnn/kernels/compute/transpose_wh.cpp
2024-03-16T05:29:28.3033542Z                  Always | INFO     | Last waypoint: NWBD,W,W,W,W 
2024-03-16T05:29:28.3035058Z terminate called after throwing an instance of 'std::runtime_error'
2024-03-16T05:29:28.3036170Z   what():  TT_THROW @ tt_metal/impl/debug/watcher_server.cpp:291: tt::exception
2024-03-16T05:29:28.3037010Z info:
2024-03-16T05:29:28.3037703Z Watcher detected an assert: core {}, riscv {}, line {}. Current kernel: {}. {}
2024-03-16T05:29:28.3038577Z (x=1,y=1)
2024-03-16T05:29:28.3039747Z               LLRuntime | INFO     | Watcher stopped the device due to tripped assert.
2024-03-16T05:29:28.3044278Z                  Always | FATAL    | Watcher detected an assert: core (x=1,y=1), riscv brisc, line 195. Current kernel: tt_eager/tt_dnn/op_library/transformer_tms/kernels/dataflow/writer_tm_tile_layout_create_qkv_heads.cpp. Note that file name reporting is not yet implemented, and the reported line number for the assert may be from a different file.
2024-03-16T05:29:28.3047638Z brisc
2024-03-16T05:29:28.3047998Z 195
TT-billteng commented 3 months ago

hey @jliangTT not sure who should own the issue, can you help?