TRISCs running out of `.TEXT` space

mo-tenstorrent commented 4 months ago

Further optimize profiler code size to reduce the likelihood of this? on WH:

./tt_metal/tools/profiler/profile_this.py -c "pytest -svv tests/tt_eager/python_api_testing/unit_testing/misc/test_scaled_dot_product_attention.py::test_sdpa_tt[1-8-1-2048-128-k128-q128-bf16]"

Has TRISC1 run out of text space.

/home/mmemarian/tt-metal/tt_metal/third_party/sfpi/compiler/bin/../lib/gcc/riscv32-unknown-elf/10.2.0/../../../../riscv32-unknown-elf/bin/ld: /home/mmemarian/tt-metal/built/3072/kernels/sdpa/2957817348340386477/trisc1/trisc1.elf section `.text' will not fit in region `TRISC1_CODE'
/home/mmemarian/tt-metal/tt_metal/third_party/sfpi/compiler/bin/../lib/gcc/riscv32-unknown-elf/10.2.0/../../../../riscv32-unknown-elf/bin/ld: region `TRISC1_CODE' overflowed by 148 bytes
collect2: error: ld returned 1 exit status
                 Always | FATAL    | trisc1 build failed
                 Always | FATAL    | Failed to generate binaries for sdpa TT_THROW @ ../tt_metal/jit_build/build.cpp:391: tt::exception
info:
trisc1 build failed
backtrace:
 --- /home/mmemarian/tt-metal/build/lib/libtt_metal.so(+0xaa581) [0x7f4e15f75581]
 --- tt::tt_metal::JitBuildState::link(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
 --- tt::tt_metal::JitBuildState::build(tt::tt_metal::JitBuildSettings const*) const
 --- /home/mmemarian/tt-metal/build/lib/libtt_metal.so(+0xb0ff5) [0x7f4e15f7bff5]
 --- /home/mmemarian/tt-metal/build/lib/libtt_metal.so(+0xb0f72) [0x7f4e15f7bf72]
 --- /home/mmemarian/tt-metal/build/lib/libtt_metal.so(+0xb0ecf) [0x7f4e15f7becf]
 --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x114df) [0x7f4eba12b4df]
 --- /home/mmemarian/tt-metal/build/lib/libtt_metal.so(+0xb0b4c) [0x7f4e15f7bb4c]
 --- /home/mmemarian/tt-metal/build/lib/libtt_metal.so(+0xb5311) [0x7f4e15f80311]
 --- /home/mmemarian/tt-metal/build/lib/libtt_metal.so(+0xb3f09) [0x7f4e15f7ef09]
 --- /home/mmemarian/tt-metal/build/lib/libtt_metal.so(+0xb3298) [0x7f4e15f7e298]
 --- /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f4e6316ddf4]
 --- /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f4eba122609]

Note that test is skipped for profiler OOM and low PCC

The unit test:

TT_METAL_DEVICE_PROFILER=1 pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_scaled_dot_product_attention_decode_gqa.py::test_sdpa_decode[1-32-8-32768-128-grid_size0-True-kv_bfp8]

can be used to replicated the issue as well. Note that the test is skipped by default on main.

mo-tenstorrent commented 4 months ago

@cglagovichTT @kevinmiTT11 @yugaoTT

mo-tenstorrent commented 4 months ago

Reducing kernel size can be a quick fix

mtairum commented 2 months ago

Also seeing this on latest main 5ad4b2b6on Mixtral. On SDPA as well.

Weirdly for 4 configurations of Mixtral decode (with different KV-cache lengths) only one is giving the issue.

[wormhole_b0-True-32-150-0.085]
[wormhole_b0-True-128-150-0.085] <- Fails
[wormhole_b0-True-1024-150-0.085]
[wormhole_b0-True-2048-150-0.085]

To replicate:

python -m tracy -v -r -p -o mixtral_ttnn -m pytest models/demos/t3000/mixtral8x7b/tests/test_mixtral_perf.py::test_mixtral_model_perf[wormhole_b0-True-128-150-0.085]

If on CI machine, weights can be found in /mnt/MLPerf/tt_dnn-models/Mistral/Mixtral-8x7B-v0.1/.

Just export these 3 flags.

export DEFAULT_CKPT_DIR="/mnt/MLPerf/tt_dnn-models/Mistral/Mixtral-8x7B-v0.1/"
export DEFAULT_TOKENIZER_PATH="/mnt/MLPerf/tt_dnn-models/Mistral/Mixtral-8x7B-v0.1/"
export DEFAULT_CACHE_PATH="/mnt/MLPerf/tt_dnn-models/Mistral/Mixtral-8x7B-v0.1/"

cglagovichTT commented 2 months ago

Depending on the compile time args, the compiler will make different optimizations in loop unrolling etc. so the kernel size is fairly unpredictable

yieldthought commented 2 months ago

This is also blocking llama 3.1 8b n150 performance optimizations: python -m tracy -r -p -v -m pytest models/demos/wormhole/llama31_8b/tests/test_llama_perf.py fails for all 3 sizes (32, 128, 1024) with profiling enabled.

/localdev/moconnor/tt-metal/tt_metal/hw/firmware/src/trisck.cc:38:3: note: in expansion of macro 'DeviceZoneScopedMainChildN'
   38 |   DeviceZoneScopedMainChildN("TRISC-KERNEL");
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~
/localdev/moconnor/tt-metal/tt_metal/third_party/sfpi/compiler/bin/../lib/gcc/riscv32-unknown-elf/10.2.0/../../../../riscv32-unknown-elf/bin/ld: /localdev/moconnor/tt-metal/built/4096/kernels/sdpa_flash_decode/4042939882799244274/trisc0/trisc0.elf section `.text' will not fit in region `TRISC0_FIRMWARE_CODE'
/localdev/moconnor/tt-metal/tt_metal/third_party/sfpi/compiler/bin/../lib/gcc/riscv32-unknown-elf/10.2.0/../../../../riscv32-unknown-elf/bin/ld: region `TRISC0_FIRMWARE_CODE' overflowed by 148 bytes
collect2: error: ld returned 1 exit status

Boosted to P1, but this is a hair away from a P0 if we don't find a workarond imo.

mo-tenstorrent commented 2 months ago

Is the SPDA op itself being optimized or is it just used in the model and is blocking report generation for the model because of the overflow?

mtairum commented 2 months ago

@mo-tenstorrent Ideally both, but the latter should be fine.

In this case I think we can just go around the SDPA op, and avoid tracing it.

Using something like this should be enough, right?

from tracy import Profiler
profiler = Profiler()

profiler.enable()
function_under_test()
profiler.disable()

yieldthought commented 2 months ago

I mean in this case we want to profile the model but an important part of the model performance is the attention op...

On Wed, 28 Aug 2024 at 17:25, Miguel Tairum @.***> wrote:

@mo-tenstorrent https://github.com/mo-tenstorrent Ideally both, but the latter should be fine.

In this case I think we can just go around the SDPA op, and avoid tracing it.

Using something like this should be enough, right?

from tracy import Profiler profiler = Profiler()

profiler.enable() function_under_test() profiler.disable()

— Reply to this email directly, view it on GitHub https://github.com/tenstorrent/tt-metal/issues/9028#issuecomment-2315668831, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHDXI3KBLJ64M277TK3X3ZTXTXJAVCNFSM6AAAAABITL66ZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJVGY3DQOBTGE . You are receiving this because you commented.Message ID: @.***>

mo-tenstorrent commented 2 months ago

I see, unfortunately as-is the enable/disable calls only work on the python side profiling. Device side profiling is on for the entire duration of the process.

Something along those lines need to be put together. As Pavle was suggesting, SPDA might be impossible to profile no matter how thin we make the profiler code. But the entire model should not be blocked to be profiled.

Let me see what is the cleanest/fastest way to this enable/disable for device side profiling.

mo-tenstorrent commented 2 months ago

diff --git a/tt_metal/hw/inc/wormhole/dev_mem_map.h b/tt_metal/hw/inc/wormhole/dev_mem_map.h
index a4b8489cf0..3649988f80 100644
--- a/tt_metal/hw/inc/wormhole/dev_mem_map.h
+++ b/tt_metal/hw/inc/wormhole/dev_mem_map.h
@@ -48,9 +48,9 @@
 #define MEM_BRISC_FIRMWARE_SIZE (10 * 1024)
 #define MEM_NCRISC_FIRMWARE_SIZE (16 * 1024)
 #define MEM_IERISC_FIRMWARE_SIZE (16 * 1024)
-#define MEM_TRISC0_FIRMWARE_SIZE (16 * 1024)
-#define MEM_TRISC1_FIRMWARE_SIZE (16 * 1024)
-#define MEM_TRISC2_FIRMWARE_SIZE (16 * 1024)
+#define MEM_TRISC0_FIRMWARE_SIZE (17 * 1024)
+#define MEM_TRISC1_FIRMWARE_SIZE (17 * 1024)
+#define MEM_TRISC2_FIRMWARE_SIZE (17 * 1024)
 #define MEM_ZEROS_SIZE 512

 #define MEM_BOOT_CODE_BASE 0

@mtairum @yieldthought Could you give the above diff a try. The failing unit test is passing with this increase. It is increasing trisc FW size by a 1KB. Certainly not a permanent solution.

mtairum commented 1 month ago

@mo-tenstorrent Tested on Mixtral.

It's passing, but at the very end, when importing data for device I get the error:

2024-08-29 10:04:51.023 | INFO     | tt_metal.tools.profiler.process_device_log:core_to_device_timeseries:360 - Importing Data For Device Number : 3

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/mtairum/tt-metal-push/ttnn/tracy/__main__.py", line 166, in <module>
    main()
  File "/home/mtairum/tt-metal-push/ttnn/tracy/__main__.py", line 150, in main
    generate_report(options.output_folder, options.name_append, options.child_functions)
  File "/home/mtairum/tt-metal-push/ttnn/tracy/__init__.py", line 166, in generate_report
    process_ops(outFolder, nameAppend, True)
  File "/home/mtairum/tt-metal-push/tt_metal/tools/profiler/process_ops_logs.py", line 534, in process_ops
    deviceOps = append_device_data(ops, PROFILER_LOGS_DIR)
  File "/home/mtairum/tt-metal-push/tt_metal/tools/profiler/process_ops_logs.py", line 193, in append_device_data
    assert len(deviceOps[device]) == len(
AssertionError: Device data mismatch. Expected 53 but received 56 ops on device 6

yieldthought commented 1 month ago

The above patch works for llama 3.1 8b (models/demos/wormhole/llama31_8b/tests/test_llama_perf.py::test_llama_model_perf[32-6-0.185]) - thanks @mo-tenstorrent, I now have a workaround to get profiling traces while we look for a more permanent solution to enable profiling to work on main again.

mo-tenstorrent commented 1 month ago

@mtairum I am suspecting this issue is actually caused by something else. This model hadn't been profiled for a while due to the TRISC overflow issue right?

As per our conversation on slack, whenever you had a chance to send me the profiler log files, it would be great.

@yieldthought That is good to hear. Yeah I am working with the runtime team and part of their plan is to separate FW and Kernel locations. With that separation, we can guarantee generating op report regardless of how big the kernel for the OP gets.

mo-tenstorrent commented 1 month ago

@mtairum I just confirmed the issue you are seeing is different from the trisc overflow I will put in a fix tomorrow.

tenstorrent / tt-metal

TRISCs running out of `.TEXT` space #9028