Balance our infra to give Forge Frontend more machines

staylorTT commented 2 weeks ago

Right now Forge Fe only has 1 machine allocated to it. This is causing a bottle neck on their development.

@nvukobratTT How many additional machines will ease the constrain?

@tapspatel How many machines can we afford to allocate to the FE side of things?

I know we looked into pooling which seemed non-trivial, however I wonder if there is a way to make this easier in the future / understand how our infra utilization is going.

nvukobratTT commented 2 weeks ago

@nsmithtt @tapspatel During this sprint, we plan to focus on E2E tests mostly for higher robustness. Therefore, we'll require more silicon CI machines than one.

What are the plans for MLIR testing? Do you have some estimates of the required workflow?

From the FFE perspective, it'll be good to have:

2 build runners - in the next few weeks 4 more folks will join FFE, so it'll be good to have those faster builders
2 more n150 runners - in the upcoming week or two, we're starting with robust op testing (e.g. testing added with many different variations and attributes e2e). This is mostly to error out FFE, MLIR, and TTNN/Metal bugs sooner rather than later, and plan accordingly for patching (especially focusing on the tt-nn side here).
Regarding n300, as multi-chip is still WIP, there is no need to have them on FFE as well

By the above points, it'll be good to balance things out a bit. Do you think that is possible?

vmilosevic commented 1 week ago

After discussing this, we go with: 51/49 split in favor of forge-fe for the cases where we have an odd number of machines

tt-mlir

builder-1 builder-2 n150-1 n150-2 n150-3 n300-1 n300-2

tt-forge-fe

builder-3 builder-4 builder-5 n150-4 n150-5 n150-6 n300-3 n300-4 n300-5

nvukobratTT commented 1 week ago

Thanks! Will those appear here once added?

vmilosevic commented 1 week ago

Yes. There are new VMs here as well so I need to setup docker, drivers, huge pages, etc first. Once this is done they will show as active runners.

nvukobratTT commented 1 week ago

Yes. There are new VMs here as well so I need to setup docker, drivers, huge pages, etc first. Once this is done they will show as active runners.

Sounds good! Thanks for pushing this further @vmilosevic 💪

tenstorrent / tt-mlir

Balance our infra to give Forge Frontend more machines #520

tt-mlir

tt-forge-fe