Optimize code coverage publishing in pipelines

makubacki commented 10 months ago

When a large number of pipeline runs at once (e.g. pytools pip update), the PublishCoverage job in Jobs/PrGate.yml significantly contributes to the overall time for the pipeline to finish.

This makes PRs take much longer to reach a status check finished state and for the overall job queue to drain. Essentially, the pipeline has to queue twice, once for all of the matrix jobs and again for the code coverage publishing job.

This issue tracks optimizing code coverage publishing to reduce the impact on overall pipeline execution time.

For example, a matrix will spawn N jobs. At the end of each matrix job checking if all other matrix jobs (N-1) are complete, perhaps using the ADO REST API, and then attempting to publish code coverage directly from the job, etc.

Javagedes commented 10 months ago

@makubacki My initial impression is that this can decrease the wait time of specific PR run if there is an existing backup of runners that need to execute, as the code coverage pipeline is queued to the back. However changing the code coverage to publish for each pipeline will increase the overall backup of all runners because it will add ~8 minutes to each pipeline where as the current implementation only adds ~9 minutes to a single pipeline. With only 30 runners we could easily feel those effects.

I think a better solution, if it exists (which it probably doesn't), is a configuration in azure devops such that these types of jobs (ones that have dependencies on jobs that have already run) are pushed to the front of the runner queue.

Another option would be to not require the CodeCoverage job to pass. That way we can quickly push through PRs that we know won't affect code coverage.

Additionally, PublishCodeCoverage@1 has a limitation in which only the first code coverage report added is used, so they do need to all be merged into one. I don't see that limitation with PublishCodeCoverage@2, but that does not necessarily mean it is not still a limitation. That is something I would need to investigate.

makubacki commented 10 months ago

I think a better solution, if it exists (which it probably doesn't), is a configuration in azure devops such that these types of jobs (ones that have dependencies on jobs that have already run) are pushed to the front of the runner queue.

Agree. I'd also prefer to simply prioritize the job if possible.

Javagedes commented 10 months ago

They added a "Run This Job now" button which is manual, but no priority settings in the pipeline or configuration in azure.

https://learn.microsoft.com/en-us/azure/devops/release-notes/2020/pipelines/sprint-175-update#run-this-job-next

The only solution someone mentioned is to have a separate agent pool with a runner or two in it, and then in the pipeline use the "demands" config to use that particular agent pool.

Javagedes commented 10 months ago

@makubacki My original implementation of the most recent code coverage changes did have the parsing and report manipulation in each of the individual matrix jobs, so if PublishCodeCoverage@2 supports merging reports then it is an easy resolution. Otherwise, we may consider uploading to CodeCov instead, which does merge results. That would work for any public repos but not private.

Javagedes commented 10 months ago

Working on this, PublishCodeCoverageResults@2 does merge coverage reports... however the way it is merged causes the report to no longer lump source files by INF (see Here.

An additional issue is that containers don't have dotnet in them, so they cannot upload the coverage data as it relies on dotnet

microsoft / mu_devops

Optimize code coverage publishing in pipelines #267