tenstorrent / tt-forge-fe

The TT-Forge FE is a graph compiler designed to optimize and transform computational graphs for deep learning models, enhancing their performance and efficiency.
https://docs.tenstorrent.com/tt-forge-fe/
Apache License 2.0
16 stars 2 forks source link

Acquire more CI lab machines for purpose of nightly runs #537

Open nvukobratTT opened 3 hours ago

nvukobratTT commented 3 hours ago

Summary

In order to add more CI tests, we need to deploy an additional set of lab machines.

Currently, there isn't a specific number of machines we need, but once we start adding nightly tests let's figure our the required number.

Right now, a rough estimate is around 10 new N150s.

vmilosevic commented 2 hours ago

Is the intention to run these tests in 10 parallel batches?

There is a pytest extension to enable this https://pypi.org/project/pytest-split/

pip install pytest-split
pytest --splits 3 --group 1
pytest --splits 3 --group 2
pytest --splits 3 --group 3

This in combination with GH Actions matrix group_id: [1,2,3,4 ..10] can be enoght to configure parallel execution Looks like it can also store test durations for even splitting. We can think of options for preserving duration information between runs.

nvukobratTT commented 1 hour ago

Is the intention to run these tests in 10 parallel batches? Yup! I expect to get around 100 models by the end of the year, so having a few more machines in our CI will be beneficial :))

There is a pytest extension to enable this https://pypi.org/project/pytest-split/ This sounds good! Just one question, is this running a few tests at a time on a single machine, or keep the logic of running a few tests across a few machines? E.g.

3 tests on 1 machine
or
3 tests on 3 machines