tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
306 stars 26 forks source link

Enable exporting some pipeline data for some overall analysis #9319

Open tt-rkim opened 1 month ago

tt-rkim commented 1 month ago

Will be starting some initial verification and PoC of certain data values at a workflow (pipeline) level.

cc: @TT-billteng @dimitri-tenstorrent

using this to track data science issues

tt-rkim commented 1 month ago

For API: We will need workflow_id and likely be using only attempt: 1 for the first MVP for data. Luckily, we can do some sort of batching since we can list workflows by status and created time: https://docs.github.com/en/rest/actions/workflow-runs?apiVersion=2022-11-28#list-workflow-runs-for-a-repository

tt-rkim commented 3 weeks ago

Relevant discussion on workflow_run triggers: https://github.com/orgs/community/discussions/128694

tt-rkim commented 3 weeks ago

@kyma-tt @vmilosevic @TT-billteng Webhook for jobs only created and sending to our infra, not workflow runs (pipelines) yet . Everything is still POC phase, so this won't be getting us all the data we want quite yet.

tt-rkim commented 2 weeks ago

Should probably have tighter data schema enforcement with something like Pydantic, but no need for now. MVP mode

tt-rkim commented 2 weeks ago

9591 future issue for data fields that are less important, but more complicated to get

tt-rkim commented 2 weeks ago

Note that JUnit xml test times include both setup and teardown per: https://docs.pytest.org/en/7.0.x/how-to/output.html#creating-junitxml-format-files

There is a way to only record the call time, but will keep default for now

tt-rkim commented 2 weeks ago

Example JUnit xml: https://gist.github.com/tt-rkim/27cb642ea394903586b0b3e810fde52d

Some good stuff, but not a whole lot.

tt-rkim commented 1 week ago

One of the requirements is we need to know which tests that we ran are associated with which GitHub job. Because we get test data from Junit XMLs, part of this work is we need to know is given a JUnit XML, we need to know with which job it's associated.

Because names can unreliable be the same across jobs, the only way to identify a job is with its job ID. Therefore, we need to know which Job ID is associated with which JUnit XML.

The JUnit XMLs are generated during job/test runtime. The problem is that there is actually no way, during job runtime/test runtime, to associate the unique Job ID of a GitHub job to anything running inside it. In other words, no matter what artifact we're trying to associate with the job, whether it's JUnit XMLs or another artifact, there's no way to uniquely identify the job in which the artifact was generated.

So even though we want to upload JUnit XMLs as an artifact to consume later for uploading, we can't attach the Job ID to it. There's therefore no way to know which jobs any given JUnit XML is from.

Other people have complained. I'm posting to https://github.com/orgs/community/discussions/8945 for help

tt-rkim commented 1 day ago

https://github.com/tenstorrent/tt-metal/pull/9659/files is the PR that enabled benchmarking on falcon t3k demos

tt-rkim commented 6 hours ago

As of today: