tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
303 stars 26 forks source link

Split up device perf CI #7019

Open mo-tenstorrent opened 3 months ago

mo-tenstorrent commented 3 months ago

Following the conversation on slack regarding the device Perf CI responsibility, to be able to better distribute CI monitoring between various model owners, CI has to be split into multiple jobs.

Initial suggestion is to do a job per model type.

This is the CI https://github.com/tenstorrent-metal/tt-metal/actions/workflows/perf-device-models.yaml

tt-rkim commented 3 months ago

Closing https://github.com/tenstorrent-metal/tt-metal/issues/7021 in favour of this one

mo-tenstorrent commented 3 months ago

@jliangTT Did we decide on the how we are gonna breakdown this CI so that each job now gets an owner?

CNN, LLM, Other might be too generalized. I can also do at the root level. The following, every line will be a job. i.e.:

This will make the CI take longer time and it can get crowded real quick

tt-rkim commented 3 months ago

Is it too generalized? I thought that's what we decided on the landing page. Have people complained this is too large buckets, or do you think so?

mo-tenstorrent commented 3 months ago

My only worry is that if we have fails at that level owners can't immediately tell if the are in charge of investigating the fail.

tt-rkim commented 3 months ago

Unless @jliangTT or @TT-billteng have different understandings of the process from me, I believe that's the point of "pipeline ownership". Their team owns the pipeline full stop, and are responsible for finding out what's wrong. They're always welcome to ask others to fix their pipelines, including infra team.

They could be the cause of the failure, they could not be. They could have the skills / knowledge to find out root cause, they could not. They could have the skills to fix the root cause, they could not. Regardless, their names are the ones liable for ensuring that it's green again.

I also believe @uaydonat has this understanding of pipeline ownership.

jliangTT commented 3 months ago

I am trying to use this chart to reason about ownership - https://docs.google.com/spreadsheets/d/1px7wdl29yeCEQQ1rQGFCQR69lk6BdRQdeEMDG0t-w3M/edit#gid=1506109495

it is updated with the latest view. Let me know if there is anything confusing

TT-billteng commented 3 months ago

are we tracking LLMs on GS?

Image

jliangTT commented 3 months ago

what is currently running pertaining to falcon/mistral/llama in GS? is this kind of the nop?

tt-rkim commented 3 months ago

As of main yesterday, no on-device perf models run on WH. Only GS.

In tests/scripts/run_performance.sh, we see that the following models are run on device perf:

run_device_perf_models() {
...
    # explicitly skips wh b0 so probably broken on it
    env pytest "tests/ttnn/integration_tests/resnet/test_performance.py" -m $test_marker
    env pytest models/demos/resnet/tests -m $test_marker
    env pytest models/demos/metal_BERT_large_11/tests -m $test_marker
    env pytest models/demos/ttnn_falcon7b/tests -m $test_marker
    # not sure what exactly is diff b/w this and metal_BERT_large_11, maybe some ops / sizes
    env pytest models/demos/bert/tests -m $test_marker
    # this doesn't even run on GS, so it's a no-op. We should reach out to test authors
    env pytest models/demos/mistral7b/tests -m $test_marker
...
}

So:

@mo-tenstorrent also made some small quality of life changes to upload the CSV perf results even if there's a perf checdk failure. I see the following models run on latest main:

Screenshot 2024-04-04 at 9 50 02 AM

you can download artifact from workflow run directly.

skhorasganiTT commented 3 months ago

Falcon7b ttnn is owned by @cfjchu

uaydonat commented 3 months ago

Hey guys, We need on-device perf models run on WH. We only optimize LLMs on WH. We would add falcon7b, mistral, mamba to the WH device-perf pipeline. It would be great if it was like llm_javelin_wormhole_b0.