Run Windows build/test workflows more regularly

ScottTodd commented 1 month ago

We currently run the .github/workflows/ci_windows_x64_msvc.yml workflow on a nightly schedule using standard GitHub-hosted runners (currently windows-2022 with 4 CPU cores, 16 GB of RAM, and 14GB of SSD). Looking at the workflow history, this is taking around 4h30m each run, which is far too slow to run on pull_request or even push events.

We should add a build runner cluster with suitably large machines configured with caching layers so we can run this workflow more regularly - ideally on every commit (pull_request and push events).

Details:

32 cores may be sufficient, but 64 or more would help. We should run some build time experiments to see.
Napkin math for budgeting: CI load is about 50-100 pull_request events per day, and jobs take 10-30 minutes.
We don't use Docker for dependencies or Windows. There aren't that many dependencies that we need though, so installing on the runners or in the workflows should both be reasonable options. See https://github.com/actions/runner-images/blob/main/images/windows/Windows2022-Readme.md for the software that GitHub provides on their runners.
Caching can use ccache or sccache and will need some configuration (https://github.com/iree-org/iree/issues/18185) and persistent remote storage (https://github.com/iree-org/iree/issues/18557).

Other considerations:

We used to use larger GitHub-hosted runners when we were on a GitHub enterprise plan. With 64 cores, these could build the project in 10-20 minutes, but they also took 4-10 minutes just to check out the repository (something like 4x longer than the "standard" runners for a network/disk limited task).
We can continue to use standard GitHub-hosted runners for smaller jobs like runtime builds/tests/releases, and in Python projects like iree-turbine that don't need to build the LLVM/MLIR-based compiler binaries.

ScottTodd commented 1 month ago

@Eliasj42 can you share your current status on this?

ScottTodd commented 1 month ago

Some stats from the newly released https://github.com/orgs/iree-org/actions/metrics/performance (see the announcement: https://github.blog/changelog/2024-10-31-actions-performance-metrics-in-public-preview/):

In the last month, ci_windows_x64_msvc.yml had 54 workflow runs, an average run time of 2h45m21s, and 59% of runs were failures. I don't see how to filter that by event, so it likely includes test runs.

Expanding the period to the last year, that workflow had 105 workflow runs, an average run time of 3h48m29s, and 48% of runs had failures.

iree-org / iree

Run Windows build/test workflows more regularly #18813