Closed odjuricicTT closed 1 month ago
@tapspatel This has become a blocker for Optimizer.
taking a look today on this issue
Dumping comments
Relevant files to look at:
third_party/tt-metal/src/tt-metal/tt_metal/tools/profiler/profiler.cpp
third_party/tt-metal/src/tt-metal/tt_metal/llrt/rtoptions.hpp
third_party/tt-metal/src/tt-metal/tt_metal/tools/profiler/kernel_profiler.hpp
third_party/tt-metal/src/tt-metal/tt_metal/jit_build/build.cpp
third_party/tt-metal/src/tt-metal/tt_metal/llrt/rtoptions.cpp
There is something that is corrupting L1, which is preventing the profiler from reading the correct results from device.
2 env flags that need to be added: TT_METAL_CLEAR_L1=1 TT_METAL_DEVICE_PROFILER=1
TT_METAL_DEVICE_PROFILER
: this will enable device profiler to initiate. Set in current terminal or in api.py as part of env_vars["TT_METAL_CLEAR_L1"] = "1" and env_vars["TT_METAL_DEVICE_PROFILER"] = "1". profiler_enabled = true can also be set manually in third_party/tt-metal/src/tt-metal/tt_metal/llrt/rtoptions.cpp
. third_party/tt-metal/src/tt-metal/tt_metal/jit_build/build.cpp
can have this->defines_ += "-DPROFILE_KERNEL=1 "; set manually. third_party/tt-metal/src/tt-metal/tt_metal/tools/profiler/kernel_profiler.hpp
can have #define PROFILE_KERNEL 1 set manually
Main place to look is in here: third_party/tt-metal/src/tt-metal/tt_metal/tools/profiler/profiler.cpp
specifically: readRiscProfilerResults. This for loop
for (auto riscEndIndex : riscEndIndices) {
has a dumpResultToFile which needs to get executed. But that block of code (in the else block) never does. That is the problem. Something is corrupting L1 before the results can be extracted.
You know you did this correctly if you see profile_log_device.csv getting generated in
in either /code/temp/tt-mlir/third_party/tt-metal/src/tt-metal/generated/profiler/.logs/ or /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/generated/profiler/.logs/
Repro commands:
./build/bin/ttmlir-opt --ttir-load-system-desc="system-desc-path=/code/temp/tt-mlir/ttrt-artifacts/system_desc.ttsys" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn
TT_METAL_CLEAR_L1=1 TT_METAL_DEVICE_PROFILER=1 ttrt perf out.ttnn --save-artifacts --clean-artifacts --device
(you can make the changes listed in the above comment as well)
slowing down our op execution: doesn't help
std::this_thread::sleep_for(std::chrono::seconds(10));
run(op, device, liveTensors, tensorPool);
std::this_thread::sleep_for(std::chrono::seconds(10));
@nsmithtt there is some l1 corruption happening that's related to our code. All the build + files seem to be generated correctly.
Full repro commands:
source env/activate
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17 -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DTTMLIR_ENABLE_RUNTIME=ON -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
cmake --build build
cmake --build build -- ttrt
ttrt query --save-artifacts
export SYSTEM_DESC_PATH=/path/to/generate/system-desc.ttsys (from previous step)
./build/bin/ttmlir-opt --ttir-load-system-desc="path=/path/to/generate/system-desc.ttsys" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn
TT_METAL_CLEAR_L1=1 TT_METAL_DEVICE_PROFILER=1 ttrt perf out.ttnn --save-artifacts --clean-artifacts --device
You can add the following changes above to your build
in runtime/tools/python/ttrt/common/api.py
env_vars["TT_METAL_DEVICE_PROFILER"] = "1"
env_vars["TT_METAL_CLEAR_L1"] = "1"
in third_party/CMakeLists.txt, add these 2 lines in ExternalProject_Add( tt-metal
-DENABLE_TRACY=${TT_RUNTIME_ENABLE_PERF_TRACE}
-DTRACY_ENABLE=${TT_RUNTIME_ENABLE_PERF_TRACE}
Everything else can be modified in third_party/tt-metal/src/tt-metal files.
All the performance reports get dumped here: /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/generated/profiler/.logs/
continuing work on this
after further investigation, we are missing this closing call: ::tt::tt_metal::detail::DumpDeviceProfileResults(&ttnn_device);
This call is done as part of pytest conftest in metal. Adding this call into runtime.cpp in mlir causes device to hang, so there is another place where its not cleaning up correctly.
Fixed in above PR
Example usage (remember to use --device flag)
ttrt perf /code/tt-mlir/build/test/ttmlir/Silicon/TTNN/Output/simple_matmul.mlir.tmp.ttnn --save-artifacts --clean-artifacts --device &> debug.log
Example usage (remember to use --device flag)
Maybe we should flip the polarity here? Maybe let's have a --host-only
flag because I think 90% of the time we'll be more interested in device perf.
Example usage (remember to use --device flag)
Maybe we should flip the polarity here? Maybe let's have a
--host-only
flag because I think 90% of the time we'll be more interested in device perf.
+1
agreed, changed
Running perf for the following flatbuffer out.ttnn create a csv ops_perf_results_2024_08_29_10_47_08.csv which is missing a lot of columns. CORE COUNT and DEVICE * DURATION are all empty,.
@tapspatel Please take a look as without this we don't have any other way of measuring op execution time.