tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
430 stars 58 forks source link

Seg fault in falcon7b prefil optimised attention #8644

Open pavlepopovic opened 4 months ago

pavlepopovic commented 4 months ago

One of the checkpoints from #8349 Upon turning on optimised attention on falcon7B prefill, we discovered that there is a segmentation fault when it is ran on 1k or 2k sequence lengths. It is occurring on single-device N300 and on 8 chip T3000. N150 does not reproduce the issue. It is possible to produce a unit test that contains the sequence of ops causing the issue (I2Spartial -> MM -> Softmax -> MM -> S2InterleavedPartial ran in multiple loops). When the segmentation fault occurs, warm reset does not work, and throws out the following error:

 Starting pci link reset on WH devices at pci indices: 0, 1, 2, 3
Traceback (most recent call last):
  File "/opt/tt_metal_infra/provisioning/provisioning_env/bin/tt-smi", line 8, in <module>
    sys.exit(main())
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.8/site-packages/tt_smi/tt_smi.py", line 672, in main
    pci_board_reset(args.reset, reinit=True)
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.8/site-packages/tt_smi/tt_smi_backend.py", line 582, in pci_board_reset
    reset_wh_boards(reset_wh_pci_idx)
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.8/site-packages/tt_smi/resets/wh_resets.py", line 271, in reset_wh_boards
    reset_devices = WHChipReset().full_lds_reset(pci_interfaces=boards_to_reset)
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.8/site-packages/tt_tools_common/wh_reset.py", line 69, in full_lds_reset
    chip.arc_msg(self.MSG_TYPE_ARC_STATE3, wait_for_done=True)
Exception: Read 0xffffffff from ARC scratch[6]: you should reset the board.

Furthermore, sudo reboot also does not bring back chips in a workable state, and this error is thrown out upon doing so:

thread '<unnamed>' panicked at crates/pyluwen/src/lib.rs:492:70:
called `Result::unwrap()` on an `Err` value: DeviceOpenFailed { id: 3, source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/opt/tt_metal_infra/provisioning/provisioning_env/bin/tt-smi", line 8, in <module>
    sys.exit(main())
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.8/site-packages/tt_smi/tt_smi.py", line 672, in main
    pci_board_reset(args.reset, reinit=True)
  File "/opt/tt_metal_infra/provisioning/provisioning_env/lib/python3.8/site-packages/tt_smi/tt_smi_backend.py", line 561, in pci_board_reset
    chip = PciChip(pci_interface=pci_idx)
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: DeviceOpenFailed { id: 3, source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }

The only way known to me to bring chips back in a workable state is to do the following (T3000 procedure)

Attaching file with unit tests (rename to .py when running locally, as GitHub doesn't allow .py attachments) test_seg_fault.txt

There's 2 unit tests in that file:

The first unit test is almost copy pasted sequence of ops from falcon7B attention.

To run it, use the following commands:

Comments:

There is also a simplified unit test in this file. It also contains the same ops as the original test, _but a sync point via ttlib.device.Synchronize() is added after each op is called, (and a print when sync point begins and ends), and there's no loops. Here's how to run that:

Comments:

  1. Multi chip run:

    • This always causes a seg fault, and always after the sync point after the first op is called.
  2. Single chip run:

    • This always causes a hang, and always when the sync point for the 3rd op is called. Upon canceling the test when the hang runs, the machine goes in same bad state as described above (not resettable except via power cycle).
    • This is reproducible on single chip N300 or on T3000
    • Running with slow dispatch (on N300, or on T3K) does not make the hang go away.

Second test is IMO easier to debug with, as the seg fault is always happening at a deterministic point.

Upon hitting a seg fault, inspecting a core dump reveals this stack trace (doesn't make sense to me, as is contains closeDevice(), maybe that is some exception handler called as a reaction to something?)

program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f12994c2740 (LWP 3560))]
(gdb) bt
#0  __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  <signal handler called>
#2  0x00007f1227d23ffa in memcpy_to_device (dest=0xe3418020, src=0x7ffdb26a5df0, num_bytes=48) at /home/ppopovic/tt-metal/tt_metal/third_party/umd/device/tt_silicon_driver.cpp:912
#3  0x00007f1227d24875 in write_block (dev=0x67aa100, byte_addr=3812720672, num_bytes=48, buffer_addr=0x7ffdb26a5df0 "", dma_buf_size=0) at /home/ppopovic/tt-metal/tt_metal/third_party/umd/device/tt_silicon_driver.cpp:1057
#4  0x00007f1227d358fa in tt_SiliconDevice::write_device_memory (this=0x6ef5570, mem_ptr=0x7ffdb26a5df0, size_in_bytes=48, target=..., address=98336, fallback_tlb="LARGE_WRITE_TLB") at /home/ppopovic/tt-metal/tt_metal/third_party/umd/device/tt_silicon_driver.cpp:2108
bt
#5  0x00007f1227d4c501 in tt_SiliconDevice::write_to_device (this=0x6ef5570, mem_ptr=0x7ffdb26a5df0, size=48, core=..., addr=98336, fallback_tlb="LARGE_WRITE_TLB", send_epoch_cmd=false, last_send_epoch_cmd=true, ordered_with_prev_remote_write=false) at /home/ppopovic/tt-metal/tt_metal/third_party/umd/device/tt_silicon_driver.cpp:4326
#6  0x00007f1228467dd0 in tt::Cluster::write_core (this=0x7f12287e0ba0 <tt::Cluster::instance()::inst>, mem_ptr=0x7ffdb26a5df0, sz_in_bytes=48, core=..., addr=98336, small_access=false) at ../tt_metal/llrt/tt_cluster.cpp:382
#7  0x00007f122846bc3e in tt::Cluster::set_internal_routing_info_for_ethernet_cores (this=0x7f12287e0ba0 <tt::Cluster::instance()::inst>, enable_internal_routing=false) at ../tt_metal/llrt/tt_cluster.cpp:724
#8  0x00007f122862d622 in tt::tt_metal::detail::CloseDevices (devices=Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer': 
std::map with 8 elements) at ../tt_metal/tt_metal.cpp:177
#9  0x00007f122a6f0cb4 in pybind11::detail::argument_loader<std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > > >::call_impl<void, void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), 0ul, pybind11::detail::void_type>(void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), std::integer_sequence<unsigned long, 0ul>, pybind11::detail::void_type&&) && (this=0x7ffdb26a5fc0, 
    f=@0x5ef8118: 0x7f122862d5f1 <tt::tt_metal::detail::CloseDevices(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >)>) at /home/ppopovic/tt-metal/tt_metal/third_party/pybind11/include/pybind11/cast.h:1443
#10 0x00007f122a6bb754 in pybind11::detail::argument_loader<std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > > >::call<void, pybind11::detail::void_type, void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >)>(void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >)) && (this=0x7ffdb26a5fc0, 
    f=@0x5ef8118: 0x7f122862d5f1 <tt::tt_metal::detail::CloseDevices(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >)>) at /home/ppopovic/tt-metal/tt_metal/third_party/pybind11/include/pybind11/cast.h:1417
#11 0x00007f122a67f7c8 in pybind11::cpp_function::initialize<void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), void, std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >, pybind11::name, pybind11::scope, pybind11::sibling, char [627]>(void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), void (*)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [627])::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call&) const (this=0x0, call=...) at /home/ppopovic/tt-metal/tt_metal/third_party/pybind11/include/pybind11/pybind11.h:248
#12 0x00007f122a67fa56 in pybind11::cpp_function::initialize<void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), void, std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >, pybind11::name, pybind11::scope, pybind11::sibling, char [627]>(void (*&)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), void (*)(std::map<int, tt::tt_metal::Device*, std::less<int>, std::allocator<std::pair<int const, tt::tt_metal::Device*> > >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [627])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () at /home/ppopovic/tt-metal/tt_metal/third_party/pybind11/include/pybind11/pybind11.h:223
#13 0x00007f122a5ad3f6 in pybind11::cpp_function::dispatcher (self=0x7f122ad23d80, args_in=0x7f1222456340, kwargs_in=0x0) at /home/ppopovic/tt-metal/tt_metal/third_party/pybind11/include/pybind11/pybind11.h:939
pavlepopovic commented 4 months ago

fyi @pavlejosipovic @jvasilje

pavlepopovic commented 4 months ago

Update: Narrowed down the problem to a 1d matmul with height-shaded input0, with fuse_batch=True, and mcast_in0 = False Here's the code that causes a hang/seq fault on N300 machine:

@pytest.mark.parametrize("num_cores", [64])
def test_problematic_matmul(device, num_cores):
    compute_grid_size = device.compute_with_storage_grid_size()
    if num_cores > (compute_grid_size.x * compute_grid_size.y):
        pytest.skip(f"Need {num_cores} cores to run this test but core grid is {compute_grid_size}")
    grid_size = (8, 8)

    in0_shape = [1, 1, 18176, 64]
    in1_shape = [1, 1, 64, 1024]

    torch_in0 = torch.randn(in0_shape).bfloat16().float()
    torch_in1 = torch.randn(in1_shape).bfloat16().float()

    dram_interleaved_memory_config = ttl.tensor.MemoryConfig(
        memory_layout=ttl.tensor.TensorMemoryLayout.INTERLEAVED,
        buffer_type=ttl.tensor.BufferType.DRAM,
    )

    height_sharded_memory_config = ttl.tensor.MemoryConfig(
        memory_layout=ttl.tensor.TensorMemoryLayout.HEIGHT_SHARDED, buffer_type=ttl.tensor.BufferType.L1
    )

    tiles_per_shard = 9
    mm_activations_height_shard_spec = [tiles_per_shard * 32, 2 * 32]

    in0_mem_config = ttl.tensor.MemoryConfig(
        ttl.tensor.TensorMemoryLayout.HEIGHT_SHARDED,
        ttl.tensor.BufferType.L1,
        ttl.tensor.ShardSpec(
            ttl.tensor.CoreRangeSet(
                {
                    ttl.tensor.CoreRange(
                        ttl.tensor.CoreCoord(0, 0),
                        ttl.tensor.CoreCoord(7, 7),
                    ),
                }
            ),
            mm_activations_height_shard_spec,
            ttl.tensor.ShardOrientation.ROW_MAJOR,
            False,
        ),
    )

    in0_tt = torch2tt_tensor(
        torch_in0,
        device,
        tt_memory_config=in0_mem_config,
        tt_dtype=ttl.tensor.DataType.BFLOAT16,
    )

    in1_tt = torch2tt_tensor(
        torch_in1,
        device,
        tt_memory_config=dram_interleaved_memory_config,
        tt_dtype=ttl.tensor.DataType.BFLOAT16,
    )

    program_config = ttl.operations.primary.MatmulMultiCoreReuseMultiCast1DProgramConfig(
        compute_with_storage_grid_size=grid_size,
        in0_block_w=2,
        per_core_M=tiles_per_shard,
        per_core_N=1024 // 32,
        out_subblock_h=1,
        out_subblock_w=1,
        fuse_batch=True,
        fused_activation=None,
        mcast_in0=False,
    )

    compute_kernel_config = ttl.tensor.WormholeComputeKernelConfig(
        math_fidelity=ttl.tensor.MathFidelity.HiFi4,
        math_approx_mode=True,
        fp32_dest_acc_en=False,
        packer_l1_acc=True,
    )

    tt_out = ttl.operations.primary.matmul(
        in0_tt,
        in1_tt,
        program_config=program_config,
        output_mem_config=dram_interleaved_memory_config,
        output_dtype=ttl.tensor.DataType.BFLOAT16,
        compute_kernel_config=compute_kernel_config,
    )

    out = tt2torch_tensor(tt_out)
    passing = True
    assert passing

Commenting out the following stuff made the test pass (incorrectly tho, without seg_faults/hangs), even when ran 100k times in a loop:

So this MM seems to be the problem.

However, I tried to run the entire test (the first test attached in this file), and it still causes a irreparable seg fault even with this fix, tho once in 30k (much less often), so it seems like there is another problem here as well. Will continue investigating.

tt-asaigal commented 4 months ago

I ran single and 8 chip variants of both tests on a T3000. We're unable to reproduce the segfault on main. For the minimal test, what I see instead is a hang after either the first or second matmul for both variants.

For the larger test I see this message when running the second matmul: RuntimeError: Read 0xffffffff from PCIE: you should reset the board.

My machine gets bricked immediately after.

Given that we're running 8x8 matmuls in both these tests and that they pass if I remove them, this is strongly indicative of a di/dt issue.

SW side Workarounds to try are:

  1. Reducing grid size (lets start w/ 8x7)
  2. Reducing sub-block size to make MM slower.
tt-asaigal commented 4 months ago

fyi @ttmtrajkovic, @davorchap and I think this could be another di/dt suspect.

pavlejosipovic commented 4 months ago

Running test with 8x7 grid resolved the issue, also @ttmtrajkovic setup the N300 card to 900Mhz and that also made the issue go away for both UTs, so it seems likely to be di/dt related.

Another, observation is the repro is not reproing consistently on all machines. On bgd-lab-06 we couldn't get a repro, but on bgd-lab-07 repro was instant, and both machines have same spec with same cards (nebula-x2, 2xN300).

pavlepopovic commented 4 months ago

Accidentally closed the issue, reopening it so that when we resolve this di/dt so we can up the grid/subblocks of these matmuls to 64. Currently, setting number of cores to 57 makes the problems go away.

tt-rkim commented 3 months ago

@pavlepopovic how is this going?

pavlepopovic commented 3 months ago

Waiting for di/dt resolutions before turning back subblocks and grid_size to their proper values. Right now they are all (1, 1), + 57 core grid. di/dt is currently being investigated by a bunch of people and they are doing experiments/playing around with firmware in order to try to mitigate the issue