jenkinsci / openstack-cloud-plugin

Provision nodes from OpenStack on demand
https://plugins.jenkins.io/openstack-cloud
MIT License
47 stars 84 forks source link

Fix #56: Introduce backoff for provisioning #365

Closed scottmarlow closed 1 year ago

scottmarlow commented 1 year ago

Perhaps this might help with https://github.com/jenkinsci/openstack-cloud-plugin/issues/56 in that on provisioning failure we do exponential backoff until we reach a certain delay point (e.g. currently hard coded to 5 minutes but that could change to whatever delay we prefer or use configuration settings for).

We will initially try connecting again after failure following a schedule like:

2 second sleep before trying again. 4 second sleep before trying again. 8 second sleep before trying again. 16 second sleep before trying again. 32 second sleep before trying again. 64 second sleep before trying again. ... Until we reach a delay/sleep of 5 minutes or higher and will then continue to sleep after each failure until we reach the timeout.

The advantage of this approach is we will use less of the Jenkins machine CPU due to the increasing backoff delay/sleep time.

  1. The jenkins.plugins.openstack.agentProvisioningBackoffOnFailureInitialSeconds parameter overrides the initial setting (default: 2 second) sleep performed after the first provisioning failure. Note that a successful provisioning followed by another failure, will again use this setting to go through the exponential backoff delay sequencing again.
  2. The jenkins.plugins.openstack.agentProvisioningBackOffLimit parameter overrides the max delay (default: 5 minutes) for sleep delay after provisioning failures.

Related (currently open) pull requests:

https://github.com/jenkinsci/openstack-cloud-plugin/pull/351 is for introducing a provisioning delay "jenkins.plugins.openstack.agentProvisioningDelaySeconds" option, which, when passed to Jenkins JVM, delays requests for deployment of each individual VM by a given amount of seconds. While my preference is to add exponential backoff via https://github.com/jenkinsci/openstack-cloud-plugin/pull/365, there may also be specific use cases where jenkins.plugins.openstack.agentProvisioningDelaySeconds can help better (e.g. to reduce the overall amount of CPU used for provisioning.) So I think both are needed.

Thanks for your time and review.

Testing done

Just passed unit tests.

### Submitter checklist
- [x] Make sure you are opening from a **topic/feature/bugfix branch** (right side) and not your main branch!
- [x] Ensure that the pull request title represents the desired changelog entry
- [x] Please describe what you did
- [x] Link to relevant issues in GitHub or Jira
- [x] Link to relevant pull requests, esp. upstream and downstream changes
- [ ] Ensure you have provided tests - that demonstrates feature works or fixes the issue
winklerm commented 1 year ago

@scottmarlow Awesome, thanks a lot for implementing this!

Ideally, I think we should be able to turn on/off and configure the behaviour from Jenkins UI/casc.yaml but I guess the system properties are a good start. LGTM, but I do not feel knowledgeable enough to approve - let's see what the maintainers say.

scottmarlow commented 1 year ago

@scottmarlow Awesome, thanks a lot for implementing this!

Ideally, I think we should be able to turn on/off and configure the behaviour from Jenkins UI/casc.yaml but I guess the system properties are a good start. LGTM, but I do not feel knowledgeable enough to approve - let's see what the maintainers say.

Thanks for the feedback @winklerm!

So, I think there are three possible options to consider for the UI/casc.yaml configuration that you are suggesting:

  1. Use exponential backoff vs constant delay after provisioning failure.
  2. Initial delay after provisioning failure (configuration item could to be shared between exponential backoff + constant delay.)
  3. Max delay after provisioning failure (configuration item could to be shared between exponential backoff + constant delay.)

I'm not really sure if the current constant delay after provisioning failure is something that we should keep but I'm not against it. If we do present the choice, that complicates use of the plugin but that can be solved via documentation changes. My preference would be to just switch completely to exponential backoff way but could make the 1-3 changes if requested (or whatever variation we determine is best.)

olivergondza commented 1 year ago

Thanks for your contribution, @scottmarlow!

Please note that the place where you have introduced the throttling, is waiting for the provisioning VM to come online (JNLP to connect, SSH to open the port 22)*, but not the provisioning itself. So the effect will be, that it will take more time for Jenkins to notice the node is up, but the provisioning rates would not change. So this needs to be moved elsewhere - JCloudsSlaveTemplate#provisionSlave() should not start with dynamically determined delay either, I would find such contract quite surprising. Let's keep it to do what it says.

https://github.com/jenkinsci/openstack-cloud-plugin/blob/42bf768cf26ae4c0e9fc57353d1cad30b3dc6d0c/plugin/src/main/java/jenkins/plugins/openstack/compute/JCloudsCloud.java#L294 is a better place to throttle provisioning spikes. And instead of introducing delay, I would go with capping the max provisioning attempts to start in one go (excessWorkload = Math.min(excessWorkload, 10)). The scheduling rate is something that Jenkins is responsible for, and the contract between Jenkins node provisioner and cloud impl is asynchronous. Waiting in this context would rock the boat for all other Jenkins clouds. So provisioning less than Jenkins suggested would cause Jenkins to ask for the rest a little later effectively implementing throttling.


*) I agree that 2-second polling period while waiting for provisioning to complete might be too short. It has been a long time since I used to see my VMs up in <10s :cry:. For my use-cases (package install in cloud-config), it is more like 60s+. I would be ok to relax this to 5-10 seconds, and personally would not see the benefit of bothering with exponential backoff (after all, we did not demonstrate that waiting for the VM to be a bottleneck of any kind). Though if to use exponential backoff, I would have the max waiting time (proposed 5min) less than anticipated provisioning times, as not to delay detecting the completion too much.

The proposed scheme would check after 40s after start, then 72s and then the delay would be <1minute. This would add tens of seconds of delay before my nodes appears in Jenkins for average provisioning attempts.

scottmarlow commented 1 year ago

is a better place to throttle provisioning spikes. And instead of introducing delay, I would go with capping the max provisioning attempts to start in one go (excessWorkload = Math.min(excessWorkload, 10)).

https://github.com/scottmarlow/openstack-cloud-plugin/tree/JCloudsCloud_throttle_provisioning_spikes has that change.

scottmarlow commented 1 year ago

My specific use case is due to migration to a new Jenkins where I am keeping my old Jenkins up in parallel for a period of time and as such I am over consuming my OpenStack environment which is reflected by (repeated) provisioning failures in the Jenkins log:

Caused: jenkins.plugins.openstack.compute.internal.Openstack$ActionFailed: Quota exceeded for cores, instances, ram: Requested 1, 1, 4096, but already used 272, 268, 1097728 of 240, 240, 983040 cores, instances, ram
        at jenkins.plugins.openstack.compute.internal.Openstack.bootAndWaitActive(Openstack.java:600)
        at jenkins.plugins.openstack.compute.JCloudsSlaveTemplate.provisionServer(JCloudsSlaveTemplate.java:347)
        at jenkins.plugins.openstack.compute.JCloudsSlaveTemplate.provisionSlave(JCloudsSlaveTemplate.java:217)
        at jenkins.plugins.openstack.compute.JCloudsCloud$NodeCallable.call(JCloudsCloud.java:332)
        at jenkins.plugins.openstack.compute.JCloudsCloud$NodeCallable.call(JCloudsCloud.java:319)
        at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
        at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

We can say don't do things like ^ that cause excessive CPU use but I think the backoff would of more gracefully dealt with consuming less Jenkins (server) CPU.

I'll add the other change to increase the provisioning poll from 2 seconds to be between 5-10 seconds which does reduce the cost of hitting the Quota exceeded but not by as much as adding the backoff (in my opinion) .

Another possible (user) workaround could be to double my quota to better handle a migration from an old Jenkins to new Jenkins but that also comes at a (money) cost of needing more OpenStack resources for a possibly short term migration period (in my case, it may be months, not sure yet).

scottmarlow commented 1 year ago

I agree that 2-second polling period while waiting for provisioning to complete might be too short. It has been a long time since I used to see my VMs up in <10s cry. For my use-cases (package install in cloud-config), it is more like 60s+. I would be ok to relax this to 5-10 seconds

https://github.com/scottmarlow/openstack-cloud-plugin/tree/JCloudsCloud_throttle_provisioning_spikes has the change to 6 seconds.

scottmarlow commented 1 year ago

My specific use case is due to migration to a new Jenkins where I am keeping my old Jenkins up in parallel for a period of time and as such I am over consuming my OpenStack environment which is reflected by (repeated) provisioning failures in the Jenkins log:

Caused: jenkins.plugins.openstack.compute.internal.Openstack$ActionFailed: Quota exceeded for cores, instances, ram: Requested 1, 1, 4096, but already used 272, 268, 1097728 of 240, 240, 983040 cores, instances, ram
        at jenkins.plugins.openstack.compute.internal.Openstack.bootAndWaitActive(Openstack.java:600)
        at jenkins.plugins.openstack.compute.JCloudsSlaveTemplate.provisionServer(JCloudsSlaveTemplate.java:347)
        at jenkins.plugins.openstack.compute.JCloudsSlaveTemplate.provisionSlave(JCloudsSlaveTemplate.java:217)
        at jenkins.plugins.openstack.compute.JCloudsCloud$NodeCallable.call(JCloudsCloud.java:332)
        at jenkins.plugins.openstack.compute.JCloudsCloud$NodeCallable.call(JCloudsCloud.java:319)
        at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
        at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

Could it make sense to check for certain failures like the Quota exceeded and backoff to a max of anticipated provisioning times? It is possible that in my case, the Quota exceeded may go away or it might not. In the case that it doesn't go away, I think that might continue to repeat for many hours.

olivergondza commented 1 year ago

Superseded by #368.