tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
428 stars 57 forks source link

Running Falcon7B and Mamba perf tests back-to-back triggers a hang #9405

Open esmalTT opened 3 months ago

esmalTT commented 3 months ago

Summary

Running Falcon7B and Mamba perf tests back-to-back causes a hang on bare metal CI machines. This issue is related to #8606, which I have closed since disabling the persistent kernel cache seems to be a workaround for this problem.

Steps to reproduce

It does not seem possible to replicate this issue on lab machines, but I have been able to get this working on machines that match the CI configuration.

  1. Get a bare-metal machine on the cloud VPN. I have tested this with WH 130.
  2. Build main in release mode. I have tested this with abfc0172dc16c726f695803bc8379ca1c2eeef25.
  3. Run model perf pipeline with the following command: ./tests/scripts/run_tests.sh --tt-arch $ARCH_NAME --pipeline-type llm_javelin_models_performance_bare_metal. This should hang on the Mamba test in a similar way to the pipeline linked above.

Further Investigation

Reproducing this issue requires the Falcon7B tests to run before the Mamba ones. Reordering the Mamba and Falcon7B tests with a fresh build does not seem to also trigger a hang.

It appears that the hang is related to the persistent program cache, since disabling the persistent cache in the Mamba tests can get it to pass - even if Falcon7B ran previously.

The hanging op is ShardedToInterleaved:

                     Op | DEBUG    | Started   C++ ttnn operation: ttnn::to_memory_config
                     Op | DEBUG    | Launching Operation: "ShardedToInterleaved" (device<Tensors>)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    grid_size = (x=8,y=7)
                     Op | DEBUG    |    sharded_op_type = ShardedOpType::ShardedToInterleaved
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)
                     Op | DEBUG    |    output_dtype = DataType::BFLOAT8_B
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::WIDTH_SHARDED,buffer_type=BufferType::L1,shard_spec=tt::tt_metal::ShardSpec(grid={[(x=0,y=0) - (x=7,y=4)]},shape={32, 128},orientation=ShardOrientation::ROW_MAJOR,halo=0))),shape=ttnn.Shape([1, 1, 32, 5120]),dtype=DataType::BFLOAT8_B,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Program Hash: 16324976833888547997 (HIT)
                     Op | DEBUG    | Kernel info: writer_unary_sharded_blocks_interleaved_start_id/7476807050483506974/

Based on the logs it seems that the cache for this particular op is populated during the Mamba execution, because there are no matching program hashes in the Falcon7B execution. I have tried to isolate the issue using ShardedToInterleaved directly but I haven't been able to reproduce the issue that way.

Watcher

The state of the device at the hang obtained through Watcher is this:

Device 0 worker core(x= 0,y= 0) phys(x= 1,y= 1):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 1,y= 0) phys(x= 2,y= 1):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 2,y= 0) phys(x= 3,y= 1):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 3,y= 0) phys(x= 4,y= 1):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 4,y= 0) phys(x= 6,y= 1):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 5,y= 0) phys(x= 7,y= 1):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 6,y= 0) phys(x= 8,y= 1):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 7,y= 0) phys(x= 9,y= 1):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 0,y= 1) phys(x= 1,y= 2):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 1,y= 1) phys(x= 2,y= 2):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 2,y= 1) phys(x= 3,y= 2):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 3,y= 1) phys(x= 4,y= 2):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 4,y= 1) phys(x= 6,y= 2):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 5,y= 1) phys(x= 7,y= 2):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 6,y= 1) phys(x= 8,y= 2):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 7,y= 1) phys(x= 9,y= 2):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 0,y= 2) phys(x= 1,y= 3):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 1,y= 2) phys(x= 2,y= 3):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 2,y= 2) phys(x= 3,y= 3):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 3,y= 2) phys(x= 4,y= 3):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 4,y= 2) phys(x= 6,y= 3):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 5,y= 2) phys(x= 7,y= 3):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 6,y= 2) phys(x= 8,y= 3):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 7,y= 2) phys(x= 9,y= 3):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 0,y= 3) phys(x= 1,y= 5):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 1,y= 3) phys(x= 2,y= 5):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 2,y= 3) phys(x= 3,y= 5):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 3,y= 3) phys(x= 4,y= 5):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 4,y= 3) phys(x= 6,y= 5):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 5,y= 3) phys(x= 7,y= 5):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 6,y= 3) phys(x= 8,y= 5):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 7,y= 3) phys(x= 9,y= 5):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 0,y= 4) phys(x= 1,y= 7):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 1,y= 4) phys(x= 2,y= 7):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 2,y= 4) phys(x= 3,y= 7):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 3,y= 4) phys(x= 4,y= 7):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 4,y= 4) phys(x= 6,y= 7):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 5,y= 4) phys(x= 7,y= 7):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 6,y= 4) phys(x= 8,y= 7):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 7,y= 4) phys(x= 9,y= 7):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 0,y= 5) phys(x= 1,y= 8):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 1,y= 5) phys(x= 2,y= 8):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 2,y= 5) phys(x= 3,y= 8):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 3,y= 5) phys(x= 4,y= 8):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 4,y= 5) phys(x= 6,y= 8):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 5,y= 5) phys(x= 7,y= 8):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 6,y= 5) phys(x= 8,y= 8):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 7,y= 5) phys(x= 9,y= 8):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 0,y= 6) phys(x= 1,y= 9):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 1,y= 6) phys(x= 2,y= 9):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 2,y= 6) phys(x= 3,y= 9):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 3,y= 6) phys(x= 4,y= 9):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 4,y= 6) phys(x= 6,y= 9):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 5,y= 6) phys(x= 7,y= 9):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 6,y= 6) phys(x= 8,y= 9):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 7,y= 6) phys(x= 9,y= 9):    R,   W,   W,   W,   W  rmsg:D1G|BNt smsg:DDDD k_ids:4|3|0
Device 0 worker core(x= 0,y= 7) phys(x= 1,y=10):  NTW, PWW,   W,   W,   W  rmsg:H1G|bNt smsg:GDDD k_ids:0|2|0
Device 0 worker core(x= 1,y= 7) phys(x= 2,y=10):  NTW, PSW,   W,   W,   W  rmsg:H1G|bNt smsg:GDDD k_ids:0|1|0
Device 0 worker core(x= 2,y= 7) phys(x= 3,y=10):   GW,   W,   W,   W,   W  rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 3,y= 7) phys(x= 4,y=10):   GW,   W,   W,   W,   W  rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 4,y= 7) phys(x= 6,y=10):   GW,   W,   W,   W,   W  rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 5,y= 7) phys(x= 7,y=10):   GW,   W,   W,   W,   W  rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 6,y= 7) phys(x= 8,y=10):   GW,   W,   W,   W,   W  rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 7,y= 7) phys(x= 9,y=10):   GW,   W,   W,   W,   W  rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
Device 0 ethnet core(x= 0,y= 4) phys(x= 7,y= 0):   GW,   X,   X,   X,   X  k_id:0
Device 0 ethnet core(x= 0,y= 5) phys(x= 3,y= 0):   GW,   X,   X,   X,   X  k_id:0
Device 0 ethnet core(x= 0,y= 6) phys(x= 6,y= 0):   GW,   X,   X,   X,   X  k_id:0
Device 0 ethnet core(x= 0,y= 7) phys(x= 4,y= 0):   GW,   X,   X,   X,   X  k_id:0
Device 0 ethnet core(x= 0,y= 8) phys(x= 9,y= 6):  RED,   X,   X,   X,   X  k_id:0
Device 0 ethnet core(x= 0,y= 9) phys(x= 1,y= 6):  RED,   X,   X,   X,   X  k_id:0
Device 0 ethnet core(x= 0,y=10) phys(x= 8,y= 6):   GW,   X,   X,   X,   X  k_id:0
Device 0 ethnet core(x= 0,y=11) phys(x= 2,y= 6):   GW,   X,   X,   X,   X  k_id:0
Device 0 ethnet core(x= 0,y=12) phys(x= 7,y= 6):   GW,   X,   X,   X,   X  k_id:0
Device 0 ethnet core(x= 0,y=13) phys(x= 3,y= 6):   GW,   X,   X,   X,   X  k_id:0
Device 0 ethnet core(x= 0,y=14) phys(x= 6,y= 6):   GW,   X,   X,   X,   X  k_id:0
k_id[0]: blank
k_id[1]: tt_metal/impl/dispatch/kernels/cq_prefetch.cpp
k_id[2]: tt_metal/impl/dispatch/kernels/cq_dispatch.cpp
k_id[3]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp
k_id[4]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
Dump #43 completed at 432.168s
esmalTT commented 3 months ago

FYI @skhorasganiTT

davorchap commented 3 months ago

Is this root cause here also persistent kernel cache?

esmalTT commented 3 months ago

@davorchap That's what it seems like to me.