onnx / onnx-mlir

Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure
Apache License 2.0
764 stars 319 forks source link

Windows CI always builds llvm-project #2481

Open gongsu832 opened 1 year ago

gongsu832 commented 1 year ago

It appears that Windows CI is failing to find the cached and therefore always rebuilds llvm-project. This makes the build take almost 4 hours to finish.

2023-09-05T18:03:37.0764952Z ##[section]Starting: Check for mlir artifact
2023-09-05T18:03:37.0873571Z ==============================================================================
2023-09-05T18:03:37.0873717Z Task         : PowerShell
2023-09-05T18:03:37.0873784Z Description  : Run a PowerShell script on Linux, macOS, or Windows
2023-09-05T18:03:37.0873895Z Version      : 2.226.2
2023-09-05T18:03:37.0873963Z Author       : Microsoft Corporation
2023-09-05T18:03:37.0874041Z Help         : https://docs.microsoft.com/azure/devops/pipelines/tasks/utility/powershell
2023-09-05T18:03:37.0874145Z ==============================================================================
2023-09-05T18:03:38.0697494Z Generating script.
2023-09-05T18:03:38.1116486Z ========================== Starting Command Output ===========================
2023-09-05T18:03:38.1356537Z ##[command]"C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" -NoLogo -NoProfile -NonInteractive -ExecutionPolicy Unrestricted -Command ". 'D:\a\_temp\c6201e30-a2ee-4f59-abe5-3fd607debe63.ps1'"
2023-09-05T18:04:53.7329503Z Found build NotFound containing artifact MLIR_Windows_91088978d712cd7b33610c59f69d87d5a39e3113
2023-09-05T18:04:53.8304662Z ##[section]Finishing: Check for mlir artifact
gongsu832 commented 1 year ago

@MikeHolman will you or some other Microsoft folk be able to take a look at this? Thanks.

MikeHolman commented 1 year ago

@jcwchen, can you take a look at this?

jcwchen commented 1 year ago

It seems that this not found issue happened occasionally somehow? I did see there are few newer commits did find the cached artifact.

gongsu832 commented 1 year ago

Yes it does seem that the problem has gone away. Thanks for looking into it. We will keep an eye on it.

sorenlassen commented 1 year ago

@jcwchen @MikeHolman any chance you can add some cores to the Windows build bot? it's generally the slowest of the CI pipelines and slows down the pace at which we can merge PRs and, moreover, LLVM uplift PRs like #2504 which need to rebuild llvm-project often time out because they can't get it done in within the 4h time limit (that PR has timed out twice already)

fwiw, the Windows build bot is much slower than my laptop, I can build llvm-project and onnx-mlir in less than 1h, so I think it would become a lot faster with more cores

jcwchen commented 1 year ago

@jcwchen @MikeHolman any chance you can add some cores to the Windows build bot? it's generally the slowest of the CI > pipelines and slows down the pace at which we can merge PRs and, moreover, LLVM uplift PRs like > https://github.com/onnx/onnx-mlir/pull/2504 which need to rebuild llvm-project often time out because they can't get it done > in within the 4h time limit (that PR has timed out twice already)

fwiw, the Windows build bot is much slower than my laptop, I can build llvm-project and onnx-mlir in less than 1h, so I think > it would become a lot faster with more cores

It seems to me the hardware settings for Microsoft-hosted agents are fixed (Microsoft-hosted agents that run Windows and Linux images are provisioned on Azure general purpose virtual machines with a 2 core CPU, 7 GB of RAM, and 14 GB of SSD disk space.): https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/hosted?view=azure-devops&tabs=yaml#hardware, and I also cannot find a way to increase the cores from my end.

For now, general PRs will use cached built LLVM, but as you mentioned, PRs for LLVM bump will need to build newer LLVM from source, which take much more time... A quick workaround would be extending the timeout number here with a reasonable number: https://github.com/onnx/onnx-mlir/blob/3b71a60079c6b1b2c8803c116ef710ffa315aaa3/.azure-pipelines/Windows-CI.yml#L14. In addition, perhaps we can try windows-2022 instead of windows-2019. Sometimes newer machines would be just faster.

A permanent solution should be hosting a "Self-hosted Windows agents" instead, but it will require more engineering cost and budget (with a more powerful CPU).

BTW, I have considered the option to move Windows CI from Azurepipelines into GitHub action, but actually Action also simply provides 2-core CPU (x86_64), 7 GB of RAM, 14 GB of SSD space for Windows: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources. Thus probably the same issue will happen there.

sorenlassen commented 1 year ago

thank you for looking into this

I applied your advice in PR #2511