Open daniel-beck opened 1 year ago
Waiting six hours (almost a work day) to get an incrementals deployment
It feels like you get a lucky shot for incrementals, if you build at peak times. I can't recall the number of times builds had to be restarted recently, because Windows machines were either disconnected or killed.
I think the Windows builds have been very slow since I re-enabled them in jenkinsci/jenkins#6024. I am not sure if throwing more hardware resources at the problem would necessarily improve performance. I seem to recall a Jira ticket being open about one possible cause: the creation and subsequent deletion of a fresh Jenkins home directory for each test (which involves extracting many .jpi
files that each contain many tiny files) is far slower on Windows than it is on Unix-like systems. I think that ticket mentioned the idea of implementing plugin class loading without unzipping each plugin .jpi
file using something like Tomcat's unpackWARs=false
, i.e. https://github.com/apache/tomcat/blob/5190f92b5e8288cde5c0f4a9814b46166e6447bb/java/org/apache/catalina/webresources/JarWarResourceSet.java. That might be possible in theory, but it would be some amount of work to implement, and based on the comments on this page the result might be just trading one performance problem for another. I doubt there is an easy or practical option in the short to medium term.
Flakiness could be mitigated in the short term by adding e.g. retry(count: 3, conditions: [kubernetesAgent(), nonresumable()])
to the Jenkinsfile
as was done in buildPlugin()
and the BOM Jenkinsfile
, though this merely hides the problem rather than fixing the root cause. Being one to prefer fixing the root cause, I would not oppose such a change (nor did I oppose it for buildPlugin()
and the BOM Jenkinsfile
) but I have not gone out of my way to implement it.
Perhaps we ought to declare that we are not getting much value from Windows testing and reduce its scope to just those tests in the org.jvnet.hudson.test.SmokeTest
group by adding -Psmoke-test
to the core Jenkinsfile
on Windows. While lowering test coverage, that would improve test runtime and cut costs, and if we do not feel the value is high it may be a decent tradeoff.
Perhaps we ought to declare that we are not getting much value from Windows testing and reduce its scope to just those tests in the
org.jvnet.hudson.test.SmokeTest
group by adding-Psmoke-test
to the coreJenkinsfile
on Windows. While lowering test coverage, that would improve test runtime and cut costs, and if we do not feel the value is high it may be a decent tradeoff.
Another alternative might be to do incrementals deployment once one (or both) of the Linux builds passed, so we don't wait for the slower Windows build to finish? While we'd want Windows coverage before merging, I would expect it to be a rare occurrence that we actively have to wait for builds to finish; while waiting for an incrementals deployment is probably more common? Of course, the use case of integrated core + plugin PRs also isn't that common…
Might need a careful look at incrementals validation to see whether this is even doable.
random thought. I believe on windows server that disk write caching is disabled by default (it is enabled by default on client OSes). If we are using ephemeral machines then if it is disabled, enabling it may well help a bit. (I know Jenkins startup is slower on windows than linux for the same hardware - but not normally by the factor that is observed in this ticket).
There are 2 options write caching, and write-cache buffer flushing. the latter option may also help (but depending on the drive it could hinder).
@lemeurherve added this to the not-actionable-by-infra-team milestone 7 days ago
Is this really the case? I think the Jenkinsfile
changes I suggested above are actionable by the infrastructure team.
@lemeurherve added this to the not-actionable-by-infra-team milestone 7 days ago
Is this really the case? I think the
Jenkinsfile
changes I suggested above are actionable by the infrastructure team.
The infra team tends to avoid changing the Jenkinsfile
of the Jenkins Core project to avoid messing up with the contribution processes, as it might impact people on knowledge areas that we do not have. This is why we added this milestone to mark this issue and watch it, but without really knowing what to do with it.
Your suggestion seems actionnable still: if I understand correctly the scope is to use make sure that failed Windows test suites are retried until the root cause is identified is correct. Is my understanding correct?
random thought. I believe on windows server that disk write caching is disabled by default (it is enabled by default on client OSes). If we are using ephemeral machines then if it is disabled, enabling it may well help a bit. (I know Jenkins startup is slower on windows than linux for the same hardware - but not normally by the factor that is observed in this ticket).
There are 2 options write caching, and write-cache buffer flushing. the latter option may also help (but depending on the drive it could hinder).
Interesting. If I understand correctly, this would be a Windows setting? Since we customize the VM images, that should be easy to do in https://github.com/jenkins-infra/packer-images/blob/main/provisioning/windows-provision.ps1 ?
Or is it a cloud-related to setup in the VM definition (e.g. in EC2 and Azure-VM plugin setups) ?
The infra team tends to avoid changing the
Jenkinsfile
of the Jenkins Core project
https://github.com/jenkins-infra/pipeline-library/commits?author=dduportal ?
The infra team tends to avoid changing the
Jenkinsfile
of the Jenkins Core projecthttps://github.com/jenkins-infra/pipeline-library/commits?author=dduportal ?
I'm not sure to understand, could you clarify?
I'm not saying that infra team is not going to take care of that. I'm saying that we (jenkins-infra team) tend to avoid touching things that we do not understand when it can impact others. Unless of course if we have an idea of the scope (and if it meets our skills and knowledge).
So I'm asking for clarification because I'm not as skilled as you or other contributors so I need help to understand what has to be done if you want me or the team to do it.
I do not see a substantial difference between working on pipeline-library
, which is effectively a set of Jenkinsfile
s for plugins and other repositories, and the Jenkinsfile
of a particular repository. If your team does not want to do the work, please move this ticket to an issue tracking component used by the development team.
I have removed this issue from the not-directly-actionable-by-infra-team
milestone. This issue is directly actionable by the infrastructure team as in the last paragraph of https://github.com/jenkins-infra/helpdesk/issues/3117#issuecomment-1235907272.
@basil I think I understand what you are saying, but please, can you let the infrastructure team manage their milestones, as it helps us to track our work in a consensual way.
For info, we do the milestone changes during the weekly meeting (which did not happen this week due to devopsworld). Based on the inputs you gave + James' inputs, the infra team was going to reconsider and see what should be done.
As I wrote previously, if the infrastructure team does not consent to doing this work, please move this ticket to an issue tracking component used by the development team.
As I wrote previously, if the infrastructure team does not consent to doing this work, please move this ticket to an issue tracking component used by the development team.
I've never implied that.
We are happy to take this task, we'll plan it on our next infra meeting to work it when we'll be able to.
Still need to determine whether write caching is enabled or disabled and enable it if necessary.
Quick update: https://github.com/jenkins-infra/jenkins-infra/pull/2635 changes the type of disk used by the VM instances (NOT Windows container!) from HDD to premium SSD.
It could be interesting to check the difference once deployed.
@dduportal Should a ci.j.io build from today show this change? https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7669/1/pipeline/146 still took 5 hrs to build on Windows, compared to 1.5 hrs on Linux.
@dduportal Should a ci.j.io build from today show this change? https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7669/1/pipeline/146 still took 5 hrs to build on Windows, compared to 1.5 hrs on Linux.
maven-17-windows
which is a Windows container (running in ACI). These container agents were not in the scope of the change above (using premium SSD). We are going to look on the ACI container resource to see if we can specify improved disk for these one. Alternative is migrating this workload from ACI to a Kubernetes cluster that we manage with High end SSDs (and Windows machines pool)Another example that's a bit simpler than core: https://ci.jenkins.io/blue/organizations/jenkins/Plugins%2Fpipeline-graph-view-plugin/detail/main/143/pipeline/69
Over 3x slower on Windows
Another example that's a bit simpler than core: https://ci.jenkins.io/blue/organizations/jenkins/Plugins%2Fpipeline-graph-view-plugin/detail/main/143/pipeline/69
Over 3x slower on Windows
it looks like it also running on an ACI container, for now we have only improved the windows VM. We will have a look on those ACI soon.
Service(s)
ci.jenkins.io
Summary
Looking through some successful builds in https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/activity I see wildly different build durations per platform:
https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7050/2/pipeline
https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7047/3/pipeline
https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7054/1/pipeline
https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/master/4002/pipeline
Waiting six hours (almost a work day) to get an incrementals deployment seems too long, especially when Linux builds are reliably done in two hours, sometimes less.
Reproduction steps
No response