Running Falcon7B and Mamba perf tests back-to-back causes a hang on bare metal CI machines. This issue is related to #8606, which I have closed since disabling the persistent kernel cache seems to be a workaround for this problem.
Steps to reproduce
It does not seem possible to replicate this issue on lab machines, but I have been able to get this working on machines that match the CI configuration.
Get a bare-metal machine on the cloud VPN. I have tested this with WH 130.
Build main in release mode. I have tested this with abfc0172dc16c726f695803bc8379ca1c2eeef25.
Run model perf pipeline with the following command: ./tests/scripts/run_tests.sh --tt-arch $ARCH_NAME --pipeline-type llm_javelin_models_performance_bare_metal. This should hang on the Mamba test in a similar way to the pipeline linked above.
Further Investigation
Reproducing this issue requires the Falcon7B tests to run before the Mamba ones. Reordering the Mamba and Falcon7B tests with a fresh build does not seem to also trigger a hang.
It appears that the hang is related to the persistent program cache, since disabling the persistent cache in the Mamba tests can get it to pass - even if Falcon7B ran previously.
The hanging op is ShardedToInterleaved:
Op | DEBUG | Started C++ ttnn operation: ttnn::to_memory_config
Op | DEBUG | Launching Operation: "ShardedToInterleaved" (device<Tensors>)
Op | DEBUG | Attributes:
Op | DEBUG | grid_size = (x=8,y=7)
Op | DEBUG | sharded_op_type = ShardedOpType::ShardedToInterleaved
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)
Op | DEBUG | output_dtype = DataType::BFLOAT8_B
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::WIDTH_SHARDED,buffer_type=BufferType::L1,shard_spec=tt::tt_metal::ShardSpec(grid={[(x=0,y=0) - (x=7,y=4)]},shape={32, 128},orientation=ShardOrientation::ROW_MAJOR,halo=0))),shape=ttnn.Shape([1, 1, 32, 5120]),dtype=DataType::BFLOAT8_B,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Program Hash: 16324976833888547997 (HIT)
Op | DEBUG | Kernel info: writer_unary_sharded_blocks_interleaved_start_id/7476807050483506974/
Based on the logs it seems that the cache for this particular op is populated during the Mamba execution, because there are no matching program hashes in the Falcon7B execution. I have tried to isolate the issue using ShardedToInterleaved directly but I haven't been able to reproduce the issue that way.
Watcher
The state of the device at the hang obtained through Watcher is this:
Summary
Running Falcon7B and Mamba perf tests back-to-back causes a hang on bare metal CI machines. This issue is related to #8606, which I have closed since disabling the persistent kernel cache seems to be a workaround for this problem.
Steps to reproduce
It does not seem possible to replicate this issue on lab machines, but I have been able to get this working on machines that match the CI configuration.
main
in release mode. I have tested this with abfc0172dc16c726f695803bc8379ca1c2eeef25../tests/scripts/run_tests.sh --tt-arch $ARCH_NAME --pipeline-type llm_javelin_models_performance_bare_metal
. This should hang on the Mamba test in a similar way to the pipeline linked above.Further Investigation
Reproducing this issue requires the Falcon7B tests to run before the Mamba ones. Reordering the Mamba and Falcon7B tests with a fresh build does not seem to also trigger a hang.
It appears that the hang is related to the persistent program cache, since disabling the persistent cache in the Mamba tests can get it to pass - even if Falcon7B ran previously.
The hanging op is
ShardedToInterleaved
:Based on the logs it seems that the cache for this particular op is populated during the Mamba execution, because there are no matching program hashes in the Falcon7B execution. I have tried to isolate the issue using
ShardedToInterleaved
directly but I haven't been able to reproduce the issue that way.Watcher
The state of the device at the hang obtained through Watcher is this: