Seg fault in falcon7b prefil optimised attention

pavlepopovic commented 4 months ago

One of the checkpoints from #8349 Upon turning on optimised attention on falcon7B prefill, we discovered that there is a segmentation fault when it is ran on 1k or 2k sequence lengths. It is occurring on single-device N300 and on 8 chip T3000. N150 does not reproduce the issue. It is possible to produce a unit test that contains the sequence of ops causing the issue (I2Spartial -> MM -> Softmax -> MM -> S2InterleavedPartial ran in multiple loops). When the segmentation fault occurs, warm reset does not work, and throws out the following error:

 Starting pci link reset on WH devices at pci indices: 0, 1, 2, 3
Traceback (most recent call last):
  File "/opt/tt_metal_infra/provisioning/provisioning_env/bin/tt-smi", line 8, in <module>
    sys.exit(main())
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.8/site-packages/tt_smi/tt_smi.py", line 672, in main
    pci_board_reset(args.reset, reinit=True)
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.8/site-packages/tt_smi/tt_smi_backend.py", line 582, in pci_board_reset
    reset_wh_boards(reset_wh_pci_idx)
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.8/site-packages/tt_smi/resets/wh_resets.py", line 271, in reset_wh_boards
    reset_devices = WHChipReset().full_lds_reset(pci_interfaces=boards_to_reset)
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.8/site-packages/tt_tools_common/wh_reset.py", line 69, in full_lds_reset
    chip.arc_msg(self.MSG_TYPE_ARC_STATE3, wait_for_done=True)
Exception: Read 0xffffffff from ARC scratch[6]: you should reset the board.

Furthermore, sudo reboot also does not bring back chips in a workable state, and this error is thrown out upon doing so:

thread '<unnamed>' panicked at crates/pyluwen/src/lib.rs:492:70:
called `Result::unwrap()` on an `Err` value: DeviceOpenFailed { id: 3, source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/opt/tt_metal_infra/provisioning/provisioning_env/bin/tt-smi", line 8, in <module>
    sys.exit(main())
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.8/site-packages/tt_smi/tt_smi.py", line 672, in main
    pci_board_reset(args.reset, reinit=True)
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.8/site-packages/tt_smi/tt_smi_backend.py", line 561, in pci_board_reset
    chip = PciChip(pci_interface=pci_idx)
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: DeviceOpenFailed { id: 3, source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }

The only way known to me to bring chips back in a workable state is to do the following (T3000 procedure)

run sudo shutdown now
log into BMC console of your machine, and manually press 'Power ON'

Attaching file with unit tests (rename to .py when running locally, as GitHub doesn't allow .py attachments) test_seg_fault.txt

There's 2 unit tests in that file:

The first unit test is almost copy pasted sequence of ops from falcon7B attention.

To run it, use the following commands:

Multi chip run - Runnable on T3K machines that have 8 chips - `WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest test_seg_fault.py -k "test_falcon7b_crash and 8chips and seq_len_1024 and 1000_loops"
Single chip run - Runnable on T3k machines (uses only one chip) or N300 machines - `WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest test_seg_fault.py -k "test_falcon7b_crash and 1chips and seq_len_1024 and 1000_loops"

Comments:

Multi chip version should cause a seg fault nearly every single time when the test is ran.
Single chip version is causing seg faults way less, and you might have several consecutive test passes. Furthermore, N300 single chip is failing way less than T3000 single chip version

There is also a simplified unit test in this file. It also contains the same ops as the original test, _but a sync point via ttlib.device.Synchronize() is added after each op is called, (and a print when sync point begins and ends), and there's no loops. Here's how to run that:

Multi chip run - Runnable on T3K machines that have 8 chips - WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest test_seg_fault.py -k "test_min_repro and 8chips"
Single chip run - Runnable on T3k machines (uses only one chip) or N300 machines - WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest test_seg_fault.py -k "test_min_repro and 1chips"

Comments:

Multi chip run:
- This always causes a seg fault, and always after the sync point after the first op is called.
Single chip run:
- This always causes a hang, and always when the sync point for the 3rd op is called. Upon canceling the test when the hang runs, the machine goes in same bad state as described above (not resettable except via power cycle).
- This is reproducible on single chip N300 or on T3000
- Running with slow dispatch (on N300, or on T3K) does not make the hang go away.

Second test is IMO easier to debug with, as the seg fault is always happening at a deterministic point.

Upon hitting a seg fault, inspecting a core dump reveals this stack trace (doesn't make sense to me, as is contains closeDevice(), maybe that is some exception handler called as a reaction to something?)

program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f12994c2740 (LWP 3560))]
(gdb) bt
#0  __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  <signal handler called>
#2  0x00007f1227d23ffa in memcpy_to_device (dest=0xe3418020, src=0x7ffdb26a5df0, num_bytes=48) at /home/ppopovic/tt-metal/tt_metal/third_party/umd/device/tt_silicon_driver.cpp:912
#3  0x00007f1227d24875 in write_block (dev=0x67aa100, byte_addr=3812720672, num_bytes=48, buffer_addr=0x7ffdb26a5df0 "", dma_buf_size=0) at /home/ppopovic/tt-metal/tt_metal/third_party/umd/device/tt_silicon_driver.cpp:1057
#4  0x00007f1227d358fa in tt_SiliconDevice::write_device_memory (this=0x6ef5570, mem_ptr=0x7ffdb26a5df0, size_in_bytes=48, target=..., address=98336, fallback_tlb="LARGE_WRITE_TLB") at /home/ppopovic/tt-metal/tt_metal/third_party/umd/device/tt_silicon_driver.cpp:2108
bt
#5  0x00007f1227d4c501 in tt_SiliconDevice::write_to_device (this=0x6ef5570, mem_ptr=0x7ffdb26a5df0, size=48, core=..., addr=98336, fallback_tlb="LARGE_WRITE_TLB", send_epoch_cmd=false, last_send_epoch_cmd=true, ordered_with_prev_remote_write=false) at /home/ppopovic/tt-metal/tt_metal/third_party/umd/device/tt_silicon_driver.cpp:4326
#6  0x00007f1228467dd0 in tt::Cluster::write_core (this=0x7f12287e0ba0 <tt::Cluster::instance()::inst>, mem_ptr=0x7ffdb26a5df0, sz_in_bytes=48, core=..., addr=98336, small_access=false) at ../tt_metal/llrt/tt_cluster.cpp:382
#7  0x00007f122846bc3e in tt::Cluster::set_internal_routing_info_for_ethernet_cores (this=0x7f12287e0ba0 <tt::Cluster::instance()::inst>, enable_internal_routing=false) at ../tt_metal/llrt/tt_cluster.cpp:724
#8  0x00007f122862d622 in tt::tt_metal::detail::CloseDevices (devices=Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer': 
std::map with 8 elements) at ../tt_metal/tt_metal.cpp:177
#9  0x00007f122a6f0cb4 in pybind11::detail::argument_loader<std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > > >::call_impl<void, void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), 0ul, pybind11::detail::void_type>(void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), std::integer_sequence<unsigned long, 0ul>, pybind11::detail::void_type&&) && (this=0x7ffdb26a5fc0, 
    f=@0x5ef8118: 0x7f122862d5f1 <tt::tt_metal::detail::CloseDevices(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >)>) at /home/ppopovic/tt-metal/tt_metal/third_party/pybind11/include/pybind11/cast.h:1443
#10 0x00007f122a6bb754 in pybind11::detail::argument_loader<std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > > >::call<void, pybind11::detail::void_type, void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >)>(void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >)) && (this=0x7ffdb26a5fc0, 
    f=@0x5ef8118: 0x7f122862d5f1 <tt::tt_metal::detail::CloseDevices(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >)>) at /home/ppopovic/tt-metal/tt_metal/third_party/pybind11/include/pybind11/cast.h:1417
#11 0x00007f122a67f7c8 in pybind11::cpp_function::initialize<void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), void, std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >, pybind11::name, pybind11::scope, pybind11::sibling, char [627]>(void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), void (*)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [627])::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call&) const (this=0x0, call=...) at /home/ppopovic/tt-metal/tt_metal/third_party/pybind11/include/pybind11/pybind11.h:248
#12 0x00007f122a67fa56 in pybind11::cpp_function::initialize<void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), void, std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >, pybind11::name, pybind11::scope, pybind11::sibling, char [627]>(void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), void (*)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [627])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () at /home/ppopovic/tt-metal/tt_metal/third_party/pybind11/include/pybind11/pybind11.h:223
#13 0x00007f122a5ad3f6 in pybind11::cpp_function::dispatcher (self=0x7f122ad23d80, args_in=0x7f1222456340, kwargs_in=0x0) at /home/ppopovic/tt-metal/tt_metal/third_party/pybind11/include/pybind11/pybind11.h:939

pavlepopovic commented 4 months ago

fyi @pavlejosipovic @jvasilje

pavlepopovic commented 4 months ago

Update: Narrowed down the problem to a 1d matmul with height-shaded input0, with fuse_batch=True, and mcast_in0 = False Here's the code that causes a hang/seq fault on N300 machine:

@pytest.mark.parametrize("num_cores", [64])
def test_problematic_matmul(device, num_cores):
    compute_grid_size = device.compute_with_storage_grid_size()
    if num_cores > (compute_grid_size.x * compute_grid_size.y):
        pytest.skip(f"Need {num_cores} cores to run this test but core grid is {compute_grid_size}")
    grid_size = (8, 8)

    in0_shape = [1, 1, 18176, 64]
    in1_shape = [1, 1, 64, 1024]

    torch_in0 = torch.randn(in0_shape).bfloat16().float()
    torch_in1 = torch.randn(in1_shape).bfloat16().float()

    dram_interleaved_memory_config = ttl.tensor.MemoryConfig(
        memory_layout=ttl.tensor.TensorMemoryLayout.INTERLEAVED,
        buffer_type=ttl.tensor.BufferType.DRAM,
    )

    height_sharded_memory_config = ttl.tensor.MemoryConfig(
        memory_layout=ttl.tensor.TensorMemoryLayout.HEIGHT_SHARDED, buffer_type=ttl.tensor.BufferType.L1
    )

    tiles_per_shard = 9
    mm_activations_height_shard_spec = [tiles_per_shard * 32, 2 * 32]

    in0_mem_config = ttl.tensor.MemoryConfig(
        ttl.tensor.TensorMemoryLayout.HEIGHT_SHARDED,
        ttl.tensor.BufferType.L1,
        ttl.tensor.ShardSpec(
            ttl.tensor.CoreRangeSet(
                {
                    ttl.tensor.CoreRange(
                        ttl.tensor.CoreCoord(0, 0),
                        ttl.tensor.CoreCoord(7, 7),
                    ),
                }
            ),
            mm_activations_height_shard_spec,
            ttl.tensor.ShardOrientation.ROW_MAJOR,
            False,
        ),
    )

    in0_tt = torch2tt_tensor(
        torch_in0,
        device,
        tt_memory_config=in0_mem_config,
        tt_dtype=ttl.tensor.DataType.BFLOAT16,
    )

    in1_tt = torch2tt_tensor(
        torch_in1,
        device,
        tt_memory_config=dram_interleaved_memory_config,
        tt_dtype=ttl.tensor.DataType.BFLOAT16,
    )

    program_config = ttl.operations.primary.MatmulMultiCoreReuseMultiCast1DProgramConfig(
        compute_with_storage_grid_size=grid_size,
        in0_block_w=2,
        per_core_M=tiles_per_shard,
        per_core_N=1024 // 32,
        out_subblock_h=1,
        out_subblock_w=1,
        fuse_batch=True,
        fused_activation=None,
        mcast_in0=False,
    )

    compute_kernel_config = ttl.tensor.WormholeComputeKernelConfig(
        math_fidelity=ttl.tensor.MathFidelity.HiFi4,
        math_approx_mode=True,
        fp32_dest_acc_en=False,
        packer_l1_acc=True,
    )

    tt_out = ttl.operations.primary.matmul(
        in0_tt,
        in1_tt,
        program_config=program_config,
        output_mem_config=dram_interleaved_memory_config,
        output_dtype=ttl.tensor.DataType.BFLOAT16,
        compute_kernel_config=compute_kernel_config,
    )

    out = tt2torch_tensor(tt_out)
    passing = True
    assert passing

Commenting out the following stuff made the test pass (incorrectly tho, without seg_faults/hangs), even when ran 100k times in a loop:

/tt-metal/tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_sender_writer_padding.cpp multicast portions commented out
tt-metal/tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp - receiving code from multicast above

So this MM seems to be the problem.

However, I tried to run the entire test (the first test attached in this file), and it still causes a irreparable seg fault even with this fix, tho once in 30k (much less often), so it seems like there is another problem here as well. Will continue investigating.

tt-asaigal commented 4 months ago

I ran single and 8 chip variants of both tests on a T3000. We're unable to reproduce the segfault on main. For the minimal test, what I see instead is a hang after either the first or second matmul for both variants.

For the larger test I see this message when running the second matmul: RuntimeError: Read 0xffffffff from PCIE: you should reset the board.

My machine gets bricked immediately after.

Given that we're running 8x8 matmuls in both these tests and that they pass if I remove them, this is strongly indicative of a di/dt issue.

SW side Workarounds to try are:

Reducing grid size (lets start w/ 8x7)
Reducing sub-block size to make MM slower.

tt-asaigal commented 4 months ago

fyi @ttmtrajkovic, @davorchap and I think this could be another di/dt suspect.

pavlejosipovic commented 4 months ago

Running test with 8x7 grid resolved the issue, also @ttmtrajkovic setup the N300 card to 900Mhz and that also made the issue go away for both UTs, so it seems likely to be di/dt related.

Another, observation is the repro is not reproing consistently on all machines. On bgd-lab-06 we couldn't get a repro, but on bgd-lab-07 repro was instant, and both machines have same spec with same cards (nebula-x2, 2xN300).

pavlepopovic commented 4 months ago

Accidentally closed the issue, reopening it so that when we resolve this di/dt so we can up the grid/subblocks of these matmuls to 64. Currently, setting number of cores to 57 makes the problems go away.

tt-rkim commented 3 months ago

@pavlepopovic how is this going?

pavlepopovic commented 3 months ago

Waiting for di/dt resolutions before turning back subblocks and grid_size to their proper values. Right now they are all (1, 1), + 57 core grid. di/dt is currently being investigated by a bunch of people and they are doing experiments/playing around with firmware in order to try to mitigate the issue

tenstorrent / tt-metal

Seg fault in falcon7b prefil optimised attention #8644