tenstorrent / tt-mlir

Tenstorrent MLIR compiler
https://tenstorrent.github.io/tt-mlir/
Apache License 2.0
72 stars 9 forks source link

TTRT perf output has empty columns #552

Closed odjuricicTT closed 1 month ago

odjuricicTT commented 2 months ago

Running perf for the following flatbuffer out.ttnn create a csv ops_perf_results_2024_08_29_10_47_08.csv which is missing a lot of columns. CORE COUNT and DEVICE * DURATION are all empty,.

@tapspatel Please take a look as without this we don't have any other way of measuring op execution time.

odjuricicTT commented 2 months ago

@tapspatel This has become a blocker for Optimizer.

tapspatel commented 1 month ago

taking a look today on this issue

tapspatel commented 1 month ago

Dumping comments

Relevant files to look at:

third_party/tt-metal/src/tt-metal/tt_metal/tools/profiler/profiler.cpp
third_party/tt-metal/src/tt-metal/tt_metal/llrt/rtoptions.hpp
third_party/tt-metal/src/tt-metal/tt_metal/tools/profiler/kernel_profiler.hpp
third_party/tt-metal/src/tt-metal/tt_metal/jit_build/build.cpp
third_party/tt-metal/src/tt-metal/tt_metal/llrt/rtoptions.cpp

There is something that is corrupting L1, which is preventing the profiler from reading the correct results from device.

2 env flags that need to be added: TT_METAL_CLEAR_L1=1 TT_METAL_DEVICE_PROFILER=1 TT_METAL_DEVICE_PROFILER: this will enable device profiler to initiate. Set in current terminal or in api.py as part of env_vars["TT_METAL_CLEAR_L1"] = "1" and env_vars["TT_METAL_DEVICE_PROFILER"] = "1". profiler_enabled = true can also be set manually in third_party/tt-metal/src/tt-metal/tt_metal/llrt/rtoptions.cpp. third_party/tt-metal/src/tt-metal/tt_metal/jit_build/build.cppcan have this->defines_ += "-DPROFILE_KERNEL=1 "; set manually. third_party/tt-metal/src/tt-metal/tt_metal/tools/profiler/kernel_profiler.hpp can have #define PROFILE_KERNEL 1 set manually

Main place to look is in here: third_party/tt-metal/src/tt-metal/tt_metal/tools/profiler/profiler.cpp specifically: readRiscProfilerResults. This for loop

for (auto riscEndIndex : riscEndIndices) {

has a dumpResultToFile which needs to get executed. But that block of code (in the else block) never does. That is the problem. Something is corrupting L1 before the results can be extracted.

You know you did this correctly if you see profile_log_device.csv getting generated in

in either /code/temp/tt-mlir/third_party/tt-metal/src/tt-metal/generated/profiler/.logs/ or /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/generated/profiler/.logs/

tapspatel commented 1 month ago

Repro commands:

./build/bin/ttmlir-opt --ttir-load-system-desc="system-desc-path=/code/temp/tt-mlir/ttrt-artifacts/system_desc.ttsys" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir 
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn
TT_METAL_CLEAR_L1=1 TT_METAL_DEVICE_PROFILER=1 ttrt perf out.ttnn --save-artifacts --clean-artifacts --device

(you can make the changes listed in the above comment as well)
tapspatel commented 1 month ago

slowing down our op execution: doesn't help

std::this_thread::sleep_for(std::chrono::seconds(10));
run(op, device, liveTensors, tensorPool);
std::this_thread::sleep_for(std::chrono::seconds(10));
tapspatel commented 1 month ago

@nsmithtt there is some l1 corruption happening that's related to our code. All the build + files seem to be generated correctly.

tapspatel commented 1 month ago

Full repro commands:

source env/activate
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17 -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DTTMLIR_ENABLE_RUNTIME=ON -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
cmake --build build
cmake --build build -- ttrt

ttrt query --save-artifacts
export SYSTEM_DESC_PATH=/path/to/generate/system-desc.ttsys (from previous step)
./build/bin/ttmlir-opt --ttir-load-system-desc="path=/path/to/generate/system-desc.ttsys" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir 
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn

TT_METAL_CLEAR_L1=1 TT_METAL_DEVICE_PROFILER=1 ttrt perf out.ttnn --save-artifacts --clean-artifacts --device

You can add the following changes above to your build

in runtime/tools/python/ttrt/common/api.py

env_vars["TT_METAL_DEVICE_PROFILER"] = "1"
env_vars["TT_METAL_CLEAR_L1"] = "1"

in third_party/CMakeLists.txt, add these 2 lines in ExternalProject_Add( tt-metal

-DENABLE_TRACY=${TT_RUNTIME_ENABLE_PERF_TRACE}
-DTRACY_ENABLE=${TT_RUNTIME_ENABLE_PERF_TRACE}

Everything else can be modified in third_party/tt-metal/src/tt-metal files.

All the performance reports get dumped here: /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/generated/profiler/.logs/

tapspatel commented 1 month ago

continuing work on this

tapspatel commented 1 month ago

after further investigation, we are missing this closing call: ::tt::tt_metal::detail::DumpDeviceProfileResults(&ttnn_device);

This call is done as part of pytest conftest in metal. Adding this call into runtime.cpp in mlir causes device to hang, so there is another place where its not cleaning up correctly.

tapspatel commented 1 month ago

Fixed in above PR

Example usage (remember to use --device flag)

ttrt perf /code/tt-mlir/build/test/ttmlir/Silicon/TTNN/Output/simple_matmul.mlir.tmp.ttnn --save-artifacts --clean-artifacts --device &> debug.log
nsmithtt commented 1 month ago

Example usage (remember to use --device flag)

Maybe we should flip the polarity here? Maybe let's have a --host-only flag because I think 90% of the time we'll be more interested in device perf.

odjuricicTT commented 1 month ago

Example usage (remember to use --device flag)

Maybe we should flip the polarity here? Maybe let's have a --host-only flag because I think 90% of the time we'll be more interested in device perf.

+1

tapspatel commented 1 month ago

agreed, changed