Open pavlepopovic opened 4 months ago
fyi @pavlejosipovic @jvasilje
Update:
Narrowed down the problem to a 1d matmul with height-shaded input0, with fuse_batch=True, and mcast_in0 = False
Here's the code that causes a hang/seq fault on N300 machine:
@pytest.mark.parametrize("num_cores", [64])
def test_problematic_matmul(device, num_cores):
compute_grid_size = device.compute_with_storage_grid_size()
if num_cores > (compute_grid_size.x * compute_grid_size.y):
pytest.skip(f"Need {num_cores} cores to run this test but core grid is {compute_grid_size}")
grid_size = (8, 8)
in0_shape = [1, 1, 18176, 64]
in1_shape = [1, 1, 64, 1024]
torch_in0 = torch.randn(in0_shape).bfloat16().float()
torch_in1 = torch.randn(in1_shape).bfloat16().float()
dram_interleaved_memory_config = ttl.tensor.MemoryConfig(
memory_layout=ttl.tensor.TensorMemoryLayout.INTERLEAVED,
buffer_type=ttl.tensor.BufferType.DRAM,
)
height_sharded_memory_config = ttl.tensor.MemoryConfig(
memory_layout=ttl.tensor.TensorMemoryLayout.HEIGHT_SHARDED, buffer_type=ttl.tensor.BufferType.L1
)
tiles_per_shard = 9
mm_activations_height_shard_spec = [tiles_per_shard * 32, 2 * 32]
in0_mem_config = ttl.tensor.MemoryConfig(
ttl.tensor.TensorMemoryLayout.HEIGHT_SHARDED,
ttl.tensor.BufferType.L1,
ttl.tensor.ShardSpec(
ttl.tensor.CoreRangeSet(
{
ttl.tensor.CoreRange(
ttl.tensor.CoreCoord(0, 0),
ttl.tensor.CoreCoord(7, 7),
),
}
),
mm_activations_height_shard_spec,
ttl.tensor.ShardOrientation.ROW_MAJOR,
False,
),
)
in0_tt = torch2tt_tensor(
torch_in0,
device,
tt_memory_config=in0_mem_config,
tt_dtype=ttl.tensor.DataType.BFLOAT16,
)
in1_tt = torch2tt_tensor(
torch_in1,
device,
tt_memory_config=dram_interleaved_memory_config,
tt_dtype=ttl.tensor.DataType.BFLOAT16,
)
program_config = ttl.operations.primary.MatmulMultiCoreReuseMultiCast1DProgramConfig(
compute_with_storage_grid_size=grid_size,
in0_block_w=2,
per_core_M=tiles_per_shard,
per_core_N=1024 // 32,
out_subblock_h=1,
out_subblock_w=1,
fuse_batch=True,
fused_activation=None,
mcast_in0=False,
)
compute_kernel_config = ttl.tensor.WormholeComputeKernelConfig(
math_fidelity=ttl.tensor.MathFidelity.HiFi4,
math_approx_mode=True,
fp32_dest_acc_en=False,
packer_l1_acc=True,
)
tt_out = ttl.operations.primary.matmul(
in0_tt,
in1_tt,
program_config=program_config,
output_mem_config=dram_interleaved_memory_config,
output_dtype=ttl.tensor.DataType.BFLOAT16,
compute_kernel_config=compute_kernel_config,
)
out = tt2torch_tensor(tt_out)
passing = True
assert passing
Commenting out the following stuff made the test pass (incorrectly tho, without seg_faults/hangs), even when ran 100k times in a loop:
So this MM seems to be the problem.
However, I tried to run the entire test (the first test attached in this file), and it still causes a irreparable seg fault even with this fix, tho once in 30k (much less often), so it seems like there is another problem here as well. Will continue investigating.
I ran single and 8 chip variants of both tests on a T3000. We're unable to reproduce the segfault on main. For the minimal test, what I see instead is a hang after either the first or second matmul for both variants.
For the larger test I see this message when running the second matmul: RuntimeError: Read 0xffffffff from PCIE: you should reset the board.
My machine gets bricked immediately after.
Given that we're running 8x8 matmuls in both these tests and that they pass if I remove them, this is strongly indicative of a di/dt issue.
SW side Workarounds to try are:
fyi @ttmtrajkovic, @davorchap and I think this could be another di/dt suspect.
Running test with 8x7 grid resolved the issue, also @ttmtrajkovic setup the N300 card to 900Mhz and that also made the issue go away for both UTs, so it seems likely to be di/dt related.
Another, observation is the repro is not reproing consistently on all machines. On bgd-lab-06 we couldn't get a repro, but on bgd-lab-07 repro was instant, and both machines have same spec with same cards (nebula-x2, 2xN300).
Accidentally closed the issue, reopening it so that when we resolve this di/dt so we can up the grid/subblocks of these matmuls to 64. Currently, setting number of cores to 57 makes the problems go away.
@pavlepopovic how is this going?
Waiting for di/dt resolutions before turning back subblocks and grid_size to their proper values. Right now they are all (1, 1), + 57 core grid. di/dt is currently being investigated by a bunch of people and they are doing experiments/playing around with firmware in order to try to mitigate the issue
One of the checkpoints from #8349 Upon turning on optimised attention on falcon7B prefill, we discovered that there is a segmentation fault when it is ran on 1k or 2k sequence lengths. It is occurring on single-device N300 and on 8 chip T3000. N150 does not reproduce the issue. It is possible to produce a unit test that contains the sequence of ops causing the issue (I2Spartial -> MM -> Softmax -> MM -> S2InterleavedPartial ran in multiple loops). When the segmentation fault occurs, warm reset does not work, and throws out the following error:
Furthermore,
sudo reboot
also does not bring back chips in a workable state, and this error is thrown out upon doing so:The only way known to me to bring chips back in a workable state is to do the following (T3000 procedure)
sudo shutdown now
Attaching file with unit tests (rename to .py when running locally, as GitHub doesn't allow .py attachments) test_seg_fault.txt
There's 2 unit tests in that file:
The first unit test is almost copy pasted sequence of ops from falcon7B attention.
To run it, use the following commands:
Comments:
There is also a simplified unit test in this file. It also contains the same ops as the original test, _but a sync point via ttlib.device.Synchronize() is added after each op is called, (and a print when sync point begins and ends), and there's no loops. Here's how to run that:
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest test_seg_fault.py -k "test_min_repro and 8chips"
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest test_seg_fault.py -k "test_min_repro and 1chips"
Comments:
Multi chip run:
Single chip run:
always
causes a hang, and always when the sync point for the 3rd op is called. Upon canceling the test when the hang runs, the machine goes in same bad state as described above (not resettable except via power cycle).Second test is IMO easier to debug with, as the seg fault is always happening at a deterministic point.
Upon hitting a seg fault, inspecting a core dump reveals this stack trace (doesn't make sense to me, as is contains closeDevice(), maybe that is some exception handler called as a reaction to something?)