tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
472 stars 75 forks source link

Perf_Mon: `python-m tracy` script features #2040

Open mo-tenstorrent opened 1 year ago

mo-tenstorrent commented 1 year ago

Skewed timers on cores due to tensix reset is one of the main causes of corruption on device profile data. We already detect that case. We should just run a sample test at the beginning of profile_this and see if skewed is detected, if so, we should error out and not move ahead.

Are all the lines needed to get the device duration column

Turn to this:

source build/python_env/bin/activate ./tt_metal/tools/profiler/profile_this.py -D -c "pytest tests/python_api_testing/unit_testing/test_resnet50_first_conv.py"

mo-tenstorrent commented 1 year ago

Add -d and -m for device only and host only runs.

mo-tenstorrent commented 1 year ago

profile_this_test.xlsx

profile_this_test.zip

The spreadsheet inside the above zip shows that before and after changes to profile_this device only and host only runs produce duration within acceptable ranges