Open gongsu832 opened 1 year ago
@MikeHolman will you or some other Microsoft folk be able to take a look at this? Thanks.
@jcwchen, can you take a look at this?
It seems that this not found issue happened occasionally somehow? I did see there are few newer commits did find the cached artifact.
Yes it does seem that the problem has gone away. Thanks for looking into it. We will keep an eye on it.
@jcwchen @MikeHolman any chance you can add some cores to the Windows build bot? it's generally the slowest of the CI pipelines and slows down the pace at which we can merge PRs and, moreover, LLVM uplift PRs like #2504 which need to rebuild llvm-project often time out because they can't get it done in within the 4h time limit (that PR has timed out twice already)
fwiw, the Windows build bot is much slower than my laptop, I can build llvm-project and onnx-mlir in less than 1h, so I think it would become a lot faster with more cores
@jcwchen @MikeHolman any chance you can add some cores to the Windows build bot? it's generally the slowest of the CI > pipelines and slows down the pace at which we can merge PRs and, moreover, LLVM uplift PRs like > https://github.com/onnx/onnx-mlir/pull/2504 which need to rebuild llvm-project often time out because they can't get it done > in within the 4h time limit (that PR has timed out twice already)
fwiw, the Windows build bot is much slower than my laptop, I can build llvm-project and onnx-mlir in less than 1h, so I think > it would become a lot faster with more cores
It seems to me the hardware settings for Microsoft-hosted agents are fixed (Microsoft-hosted agents that run Windows and Linux images are provisioned on Azure general purpose virtual machines with a 2 core CPU, 7 GB of RAM, and 14 GB of SSD disk space.
): https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/hosted?view=azure-devops&tabs=yaml#hardware, and I also cannot find a way to increase the cores from my end.
For now, general PRs will use cached built LLVM, but as you mentioned, PRs for LLVM bump will need to build newer LLVM from source, which take much more time... A quick workaround would be extending the timeout number here with a reasonable number: https://github.com/onnx/onnx-mlir/blob/3b71a60079c6b1b2c8803c116ef710ffa315aaa3/.azure-pipelines/Windows-CI.yml#L14. In addition, perhaps we can try windows-2022
instead of windows-2019
. Sometimes newer machines would be just faster.
A permanent solution should be hosting a "Self-hosted Windows agents" instead, but it will require more engineering cost and budget (with a more powerful CPU).
BTW, I have considered the option to move Windows CI from Azurepipelines into GitHub action, but actually Action also simply provides 2-core CPU (x86_64), 7 GB of RAM, 14 GB of SSD space
for Windows: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources. Thus probably the same issue will happen there.
thank you for looking into this
I applied your advice in PR #2511
It appears that Windows CI is failing to find the cached and therefore always rebuilds llvm-project. This makes the build take almost 4 hours to finish.