Windows agents are soooooooooo slooooooooooooooooooow

daniel-beck commented 1 year ago

Service(s)

ci.jenkins.io

Summary

Looking through some successful builds in https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/activity I see wildly different build durations per platform:

https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7050/2/pipeline

Linux JDK11: ~ 1 hr 50 min
Linux JDK17: ~ 1 hr 40 min
Windows JDK11: ~ 5 hrs 50 minutes

https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7047/3/pipeline

Linux JDK11: ~ 1 hr 40 min
Linux JDK17: ~ 1 hr 35 min
Windows JDK11: ~ 5 hrs

https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7054/1/pipeline

Linux JDK11: ~ 2 hrs
Linux JDK17: ~ 1 hr 35 min
Windows JDK11: ~ 5 hrs 40 min

https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/master/4002/pipeline

Linux JDK11: ~ 1 hr 30 min
Linux JDK17: ~ 2 hrs 5 min
Windows JDK11: ~ 5 hrs 50 min

Waiting six hours (almost a work day) to get an incrementals deployment seems too long, especially when Linux builds are reliably done in two hours, sometimes less.

Reproduction steps

No response

NotMyFault commented 1 year ago

Waiting six hours (almost a work day) to get an incrementals deployment

It feels like you get a lucky shot for incrementals, if you build at peak times. I can't recall the number of times builds had to be restarted recently, because Windows machines were either disconnected or killed.

basil commented 1 year ago

I think the Windows builds have been very slow since I re-enabled them in jenkinsci/jenkins#6024. I am not sure if throwing more hardware resources at the problem would necessarily improve performance. I seem to recall a Jira ticket being open about one possible cause: the creation and subsequent deletion of a fresh Jenkins home directory for each test (which involves extracting many .jpi files that each contain many tiny files) is far slower on Windows than it is on Unix-like systems. I think that ticket mentioned the idea of implementing plugin class loading without unzipping each plugin .jpi file using something like Tomcat's unpackWARs=false, i.e. https://github.com/apache/tomcat/blob/5190f92b5e8288cde5c0f4a9814b46166e6447bb/java/org/apache/catalina/webresources/JarWarResourceSet.java. That might be possible in theory, but it would be some amount of work to implement, and based on the comments on this page the result might be just trading one performance problem for another. I doubt there is an easy or practical option in the short to medium term.

Flakiness could be mitigated in the short term by adding e.g. retry(count: 3, conditions: [kubernetesAgent(), nonresumable()]) to the Jenkinsfile as was done in buildPlugin() and the BOM Jenkinsfile, though this merely hides the problem rather than fixing the root cause. Being one to prefer fixing the root cause, I would not oppose such a change (nor did I oppose it for buildPlugin() and the BOM Jenkinsfile) but I have not gone out of my way to implement it.

Perhaps we ought to declare that we are not getting much value from Windows testing and reduce its scope to just those tests in the org.jvnet.hudson.test.SmokeTest group by adding -Psmoke-test to the core Jenkinsfile on Windows. While lowering test coverage, that would improve test runtime and cut costs, and if we do not feel the value is high it may be a decent tradeoff.

daniel-beck commented 1 year ago

Perhaps we ought to declare that we are not getting much value from Windows testing and reduce its scope to just those tests in the org.jvnet.hudson.test.SmokeTest group by adding -Psmoke-test to the core Jenkinsfile on Windows. While lowering test coverage, that would improve test runtime and cut costs, and if we do not feel the value is high it may be a decent tradeoff.

Another alternative might be to do incrementals deployment once one (or both) of the Linux builds passed, so we don't wait for the slower Windows build to finish? While we'd want Windows coverage before merging, I would expect it to be a rare occurrence that we actively have to wait for builds to finish; while waiting for an incrementals deployment is probably more common? Of course, the use case of integrated core + plugin PRs also isn't that common…

Might need a careful look at incrementals validation to see whether this is even doable.

jtnord commented 1 year ago

random thought. I believe on windows server that disk write caching is disabled by default (it is enabled by default on client OSes). If we are using ephemeral machines then if it is disabled, enabling it may well help a bit. (I know Jenkins startup is slower on windows than linux for the same hardware - but not normally by the factor that is observed in this ticket).

There are 2 options write caching, and write-cache buffer flushing. the latter option may also help (but depending on the drive it could hinder).

basil commented 1 year ago

@lemeurherve added this to the not-actionable-by-infra-team milestone 7 days ago

Is this really the case? I think the Jenkinsfile changes I suggested above are actionable by the infrastructure team.

dduportal commented 1 year ago

@lemeurherve added this to the not-actionable-by-infra-team milestone 7 days ago

Is this really the case? I think the Jenkinsfile changes I suggested above are actionable by the infrastructure team.

The infra team tends to avoid changing the Jenkinsfileof the Jenkins Core project to avoid messing up with the contribution processes, as it might impact people on knowledge areas that we do not have. This is why we added this milestone to mark this issue and watch it, but without really knowing what to do with it.

Your suggestion seems actionnable still: if I understand correctly the scope is to use make sure that failed Windows test suites are retried until the root cause is identified is correct. Is my understanding correct?

dduportal commented 1 year ago

random thought. I believe on windows server that disk write caching is disabled by default (it is enabled by default on client OSes). If we are using ephemeral machines then if it is disabled, enabling it may well help a bit. (I know Jenkins startup is slower on windows than linux for the same hardware - but not normally by the factor that is observed in this ticket).

There are 2 options write caching, and write-cache buffer flushing. the latter option may also help (but depending on the drive it could hinder).

Interesting. If I understand correctly, this would be a Windows setting? Since we customize the VM images, that should be easy to do in https://github.com/jenkins-infra/packer-images/blob/main/provisioning/windows-provision.ps1 ?

Or is it a cloud-related to setup in the VM definition (e.g. in EC2 and Azure-VM plugin setups) ?

basil commented 1 year ago

The infra team tends to avoid changing the Jenkinsfile of the Jenkins Core project

https://github.com/jenkins-infra/pipeline-library/commits?author=dduportal ?

dduportal commented 1 year ago

The infra team tends to avoid changing the Jenkinsfile of the Jenkins Core project

https://github.com/jenkins-infra/pipeline-library/commits?author=dduportal ?

I'm not sure to understand, could you clarify?

I'm not saying that infra team is not going to take care of that. I'm saying that we (jenkins-infra team) tend to avoid touching things that we do not understand when it can impact others. Unless of course if we have an idea of the scope (and if it meets our skills and knowledge).

So I'm asking for clarification because I'm not as skilled as you or other contributors so I need help to understand what has to be done if you want me or the team to do it.

basil commented 1 year ago

I do not see a substantial difference between working on pipeline-library, which is effectively a set of Jenkinsfiles for plugins and other repositories, and the Jenkinsfile of a particular repository. If your team does not want to do the work, please move this ticket to an issue tracking component used by the development team.

basil commented 1 year ago

I have removed this issue from the not-directly-actionable-by-infra-team milestone. This issue is directly actionable by the infrastructure team as in the last paragraph of https://github.com/jenkins-infra/helpdesk/issues/3117#issuecomment-1235907272.

dduportal commented 1 year ago

@basil I think I understand what you are saying, but please, can you let the infrastructure team manage their milestones, as it helps us to track our work in a consensual way.

For info, we do the milestone changes during the weekly meeting (which did not happen this week due to devopsworld). Based on the inputs you gave + James' inputs, the infra team was going to reconsider and see what should be done.

basil commented 1 year ago

As I wrote previously, if the infrastructure team does not consent to doing this work, please move this ticket to an issue tracking component used by the development team.

dduportal commented 1 year ago

As I wrote previously, if the infrastructure team does not consent to doing this work, please move this ticket to an issue tracking component used by the development team.

I've never implied that.

We are happy to take this task, we'll plan it on our next infra meeting to work it when we'll be able to.

basil commented 1 year ago

Still need to determine whether write caching is enabled or disabled and enable it if necessary.

dduportal commented 1 year ago

Quick update: https://github.com/jenkins-infra/jenkins-infra/pull/2635 changes the type of disk used by the VM instances (NOT Windows container!) from HDD to premium SSD.

It could be interesting to check the difference once deployed.

daniel-beck commented 1 year ago

@dduportal Should a ci.j.io build from today show this change? https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7669/1/pipeline/146 still took 5 hrs to build on Windows, compared to 1.5 hrs on Linux.

dduportal commented 1 year ago

@dduportal Should a ci.j.io build from today show this change? https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7669/1/pipeline/146 still took 5 hrs to build on Windows, compared to 1.5 hrs on Linux.

The builds on ci.jenkins.io using a Windows VM agent should are expected to see an improvement, but it has to be confirmed.
The job you linked uses the label maven-17-windows which is a Windows container (running in ACI). These container agents were not in the scope of the change above (using premium SSD). We are going to look on the ACI container resource to see if we can specify improved disk for these one. Alternative is migrating this workload from ACI to a Kubernetes cluster that we manage with High end SSDs (and Windows machines pool)

timja commented 1 year ago

Another example that's a bit simpler than core: https://ci.jenkins.io/blue/organizations/jenkins/Plugins%2Fpipeline-graph-view-plugin/detail/main/143/pipeline/69

Over 3x slower on Windows

smerle33 commented 1 year ago

Another example that's a bit simpler than core: https://ci.jenkins.io/blue/organizations/jenkins/Plugins%2Fpipeline-graph-view-plugin/detail/main/143/pipeline/69

Over 3x slower on Windows

it looks like it also running on an ACI container, for now we have only improved the windows VM. We will have a look on those ACI soon.

jenkins-infra / helpdesk