jenkinsci / build-timeout-plugin

Jenkins build-timeout plugin
https://plugins.jenkins.io/build-timeout/
31 stars 80 forks source link

Introduce Global Time Out #81

Closed sghill closed 2 years ago

sghill commented 4 years ago

This change introduces a global build time out that re-uses the existing strategies and operations. The current functionality requires users to opt-in and update the configuration of each build.

Goal

I'd like to use a global timeout to ensure builds don't run longer than the set time. Occasionally builds hang, and in large deployments it's easy for this to go unnoticed. A hanging build that goes unnoticed occupies an executor indefinitely, reducing the capacity of the system and potentially allowing the queue for a particular job to build up.

Implementation

On each build:

  1. Find a configured global time out
  2. Schedule the configured time out operations on a dedicated ScheduledExecutorService with one thread
  3. Once complete, cancel the task

Implemented with a hudson.model.listeners.RunListener so it can be applied to every build. The existing implementation uses a hudson.tasks.BuildWrapper, which requires a checkbox on each build.

Screenshots

The configuration screen looks like this: system-configuration

When a build fails due to a global timeout, it is noted in the build's log: build-log

Extensive logging is available on the FINE level: logs

sghill commented 4 years ago

@oleg-nenashev thanks for the encouragement of a global build timeout. I'd appreciate your thoughts on this implementation if you get a chance.

krisstern commented 2 years ago

Hi @sghill Thanks for this PR! However, could you please fix the sole failing test for the Windows build before I can do a thorough review of it? Please note also that some PR's have recently been merged so please be careful to rebase before pushing again if you have not already done so.

sghill commented 2 years ago

Hi @krisstern, thanks for the ping. I rebased the PR and the tests are all passing.

krisstern commented 2 years ago

Thanks @sghill for the rebase! Let me start the reviewing process soon.

krisstern commented 2 years ago

HI @sghill My apologies for having to ask you to rebase one more time, as I have just merged a PR to switch Java support from version 8 to 11. Without this change I cannot check out then try out your new feature, as I was experiencing some setup issues previously. Many thanks for this!

sghill commented 2 years ago

Hi @krisstern, I've rebased and updated to Mockito 4 but I'm getting some odd errors from the windows build. Is it possible there is an issue with the Java install on this agent?

[2022-02-18T18:54:49.915Z] [ERROR] C:\Jenkins\workspace\ugins_build-timeout-plugin_PR-81\src\main\java\hudson\plugins\build_timeout\global\Lifecycle.java:31: Undefined reference: java.util.function.Supplier

Separately, I wanted to share that I'd be happy with other implementations as well. I tried to make this PR testable and fit well within the plugin by supporting existing strategies, but I only plan to use a fixed 24 hour timeout on our Jenkins controllers. If there is a simpler way to achieve that, I'm all for it.

krisstern commented 2 years ago

Hi @sghill I think it may have to do with the signature currently set for the animal-sniffer-maven-plugin to be 17, as there are currently no available signature for Java 11: https://github.com/jenkinsci/build-timeout-plugin/blob/65e8275199c976ea85521425ab26550f26488ff8/pom.xml#L67-L85

So it may have been introduced by me inadvertently while trying for a quick fix that worked previously. So I would recommend you do try and tweak the POM a bit according to https://github.com/mojohaus/animal-sniffer/issues/62#issuecomment-459021078. I think the error we are encountering is similar to the one for Java 8 discussed on Stack Overflow at https://stackoverflow.com/q/45997103/9959070.

sghill commented 2 years ago

@krisstern The build is green.

My read of this comment is the animal-sniffer-plugin can be removed because the maven-compiler-plugin now covers the same featureset when the release flag is set (it already is):

Apart from that you can use the release configuration in maven-compiler-plugin with JDK9+ to have exactly what animal sniffer offers and that's the reason why there are no JDK9+ signatures.

Unfortunately removing the plugin block from the pom doesn't work, so it seems the plugin is requiring 11 but disallowing any new APIs after 8. It'd be nice to keep the minimum version at 8 if that's the case.

krisstern commented 2 years ago

Yeah, I can understand. On top of the aforementioned comment there is also https://github.com/mojohaus/animal-sniffer/issues/62#issuecomment-459021086, so maybe I could take another look after this PR is reviewed and processed to see what to do with animal-sniffer next. My hunch is sooner or later we will need to get rid of it one way or the other with some config.

krisstern commented 2 years ago

Hi @sghill I am in the process of reviewing the code in this PR. So far so good, I cannot help but noticed the following default time limit of 3 minutes or 180 seconds for the Absolute and No Activity timeout strategies accordingly:

Screenshot 2022-02-27 at 12 03 48 PM Screenshot 2022-02-27 at 12 04 20 PM

I think the timeout settings of the other strategies look okay. But would like to know if we could increase the default to something slightly longer like 5 minutes or 300 seconds for these two strategies only, respectively? Or did you have anything particular in mind when setting the default to 3 minutes effectively to both these strategies. Three minutes seem a bit short for me, but of course it depends on what we are trying to build as well. Say previously in your example you had been using 4 minutes for the Absolute strategy...

krisstern commented 2 years ago

I have reviewed the code, everything looks okay

krisstern commented 2 years ago

@oleg-nenashev If you have some time, please review this PR so I could process it promptly.

sghill commented 2 years ago

if we could increase the default to something slightly longer like 5 minutes or 300 seconds for these two strategies only, respectively? Or did you have anything particular in mind when setting the default to 3 minutes effectively to both these strategies.

Hi @krisstern, definitely didn't intend for a 3 minute default. I think that's coming from here? I imagine very few users would use the default, so perhaps it's best to clear it out if possible? I intend to set to 24 hours on our instances.

krisstern commented 2 years ago

Hi @sghill Or alternatively we could overwrite the timeout duration for the strategy say for the Absolute Timeout Strategy to something else here... I noticed the timeout settings of the Deadline Timeout Strategy, the Elastic Timeout Strategy, and also the Likely Stuck Timeout Strategy seem more agreeable.

So maybe we could look into tweaking the default timeout settings somewhere like the following two places for the Absolute Timeout Strategy class file? I will need time to look into the impact of modifying the BuildTimeoutWrapper.MINIMUM_TIMEOUT_MILLISECONDS more thoroughly later. It may be needed to set the lower bound of the timeout limits still.

https://github.com/jenkinsci/build-timeout-plugin/blob/65e8275199c976ea85521425ab26550f26488ff8/src/main/java/hudson/plugins/build_timeout/impl/AbsoluteTimeOutStrategy.java#L32

https://github.com/jenkinsci/build-timeout-plugin/blob/65e8275199c976ea85521425ab26550f26488ff8/src/main/java/hudson/plugins/build_timeout/impl/AbsoluteTimeOutStrategy.java#L43

krisstern commented 2 years ago

Or we could leave things as they are... And let the user choose a timeout value manually different from the default one.

krisstern commented 2 years ago

@sghill Some conflicts have been introduced because of some urgent PR that needed to be closed. Would you like to rebase your PR so I could have it merge as soon as possible? My apologies for the confusion.

sghill commented 2 years ago

Hi @krisstern. I have rebased, but noticed a couple issues I'd like your input on before merging.

  1. I'd like to disable the global timeout strategy by unchecking the box. Today if I uncheck and reload the page, it's checked again. Is there a recommended way to do this?
  2. when I mvn hpi:run, I'm seeing an error that looks like it may expect a ldap dependency?
WARNING h.ExtensionFinder$GuiceFinder$FaultTolerantScope$1#error: Failed to instantiate Key[type=jenkins.security.plugins.ldap.FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl, annotation=[none]]; skipping this component
java.lang.ClassNotFoundException: org.acegisecurity.userdetails.ldap.LdapUserDetails
    at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1402)
    at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1357)
    at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1112)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
Caused: java.lang.NoClassDefFoundError: org/acegisecurity/userdetails/ldap/LdapUserDetails
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
    at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
    at java.lang.Class.getMethod0(Class.java:3018)
    at java.lang.Class.getMethod(Class.java:1784)
    at hudson.model.Descriptor.<init>(Descriptor.java:304)
    at jenkins.security.plugins.ldap.LDAPGroupMembershipStrategyDescriptor.<init>(LDAPGroupMembershipStrategyDescriptor.java:32)
    at jenkins.security.plugins.ldap.FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl.<init>(FromUserRecordLDAPGroupMembershipStrategy.java:94)
    at jenkins.security.plugins.ldap.FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl$$FastClassByGuice$$491260477.GUICE$TRAMPOLINE(<generated>)
    at jenkins.security.plugins.ldap.FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl$$FastClassByGuice$$491260477.apply(<generated>)
    at com.google.inject.internal.DefaultConstructionProxyFactory$FastClassProxy.newInstance(DefaultConstructionProxyFactory.java:82)
    at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:114)
    at com.google.inject.internal.ConstructorInjector.access$000(ConstructorInjector.java:33)
    at com.google.inject.internal.ConstructorInjector$1.call(ConstructorInjector.java:98)
    at com.google.inject.internal.ProvisionListenerStackCallback$Provision.provision(ProvisionListenerStackCallback.java:109)
    at hudson.ExtensionFinder$GuiceFinder$SezpozModule.onProvision(ExtensionFinder.java:568)
    at com.google.inject.internal.ProvisionListenerStackCallback$Provision.provision(ProvisionListenerStackCallback.java:117)
    at com.google.inject.internal.ProvisionListenerStackCallback.provision(ProvisionListenerStackCallback.java:66)
    at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:93)
    at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:296)
    at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
Caused: com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) [Guice/ErrorInjectingConstructor]: NoClassDefFoundError: org/acegisecurity/userdetails/ldap/LdapUserDetails
  at FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl.<init>(FromUserRecordLDAPGroupMembershipStrategy.java:94)

Learn more:
  https://github.com/google/guice/wiki/ERROR_INJECTING_CONSTRUCTOR

1 error

======================
Full classname legend:
======================
FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl: "jenkins.security.plugins.ldap.FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl"
========================
End of classname legend:
========================
krisstern commented 2 years ago

Hi @sghill Thanks for the prompt response!

Let me answer the checkbox question first. I think to disable it you can modify https://github.com/jenkinsci/build-timeout-plugin/blob/c7809e44736d693b8ae8990fa688f10eef8db492/src/main/resources/hudson/plugins/build_timeout/global/GlobalTimeOutConfiguration/config.jelly#L7

I think it is better to get rid of the option checked="${instance.enabled}" so that by default the options is disabled, like the example they have for OptionalBlock at https://wiki.jenkins.io/display/JENKINS/Jelly+form+controls.

krisstern commented 2 years ago

The build seems to have passed all CI/CD tests on GitHub though, so should be okay. Could be something with your local setup.

krisstern commented 2 years ago

BTW, have you tried running mvn clean before you run mvn hpi:run, and did it make a difference? I will check out this branch and test locally later over the next few days just to make sure everything is okay.

sghill commented 2 years ago

Hi @krisstern,

I think it is better to get rid of the option checked="${instance.enabled}" so that by default the options is disabled

I'm seeing it disabled by default, are you seeing different behavior? When I remove the checked attribute it shows as unchecked on page reload, even if there is a global timeout set. The timeout is active too.

I think it should automatically show on page load as checked if configured, but when unchecked the configuration is cleared. I've updated the code to do this -- if the submitted form object is null, we now manually null out the class properties before saving.

have you tried running mvn clean before you run mvn hpi:run, and did it make a difference?

I tried mvn clean and the stacktrace still appeared. Manually removing the ldap plugin from work/plugins did solve it.

New Logs

I've added toString implementations on all the strategies for use with logging. I don't think this should be a problem, but wanted to call it out since it's new. Here are some examples of the logs it will show now:

On load:

INFO    h.p.b.g.GlobalTimeOutConfiguration#load: global timeout not set
INFO    h.p.b.g.GlobalTimeOutConfiguration#load: global timeout loaded as AbsoluteTimeOutStrategy[timeoutMinutes='1440'] with operations: AbortOperation

On update:

INFO    h.p.b.g.GlobalTimeOutConfiguration#configure: global timeout updated to AbsoluteTimeOutStrategy[timeoutMinutes='300'] with operations: AbortOperation, WriteDescriptionOperation
INFO    h.p.b.g.GlobalTimeOutConfiguration#configure: global timeout updated to LikelyStuckTimeOutStrategy[preferred='10 x estimated duration', fallback='24 hours'] with operations: AbortOperation

On clearing the 'Enable' checkbox:

INFO    h.p.b.g.GlobalTimeOutConfiguration#configure: global timeout has been cleared
krisstern commented 2 years ago

Hi @sghill My apologies I think I may have misunderstood your question regarding the checkbox earlier. So I have tested your latest branch locally on my computer and everything works fine.

I have approved the PR and will merge it over the next few days. Thanks for your contribution!

sghill commented 2 years ago

Thanks @krisstern! Is there a release coming up soon? Also happy to try out a candidate and report back.

sghill commented 2 years ago

Hi @krisstern, can I help create a new release? Looks like the last one was May 2020.

krisstern commented 2 years ago

Hi @sghill. Yup, I agree it is time for me to cut a new release. Let me study up the docs and get back to you shortly. I had missed your prevous message hence the late reply.

krisstern commented 2 years ago

Hi @sghill I have been trying to get the plugin code working locally on my new laptop over the last few days. However, I have been running into some problems getting the tests working again. Seems like once I have fixed up the POM file after updating to the latest weekly jenkins version, the hudson.Extension.optional() method from java.lang.* is gone due to some dependency issues.

As a result, I may need more time to get the code running again before I can cut a release.

BTW, would you like to help me to debug this issue? I could send you the new pom.xml file I have for you to see if there is anything that can be done with it to make the code run again.

sghill commented 2 years ago

Hi @krisstern, sure I can take a look. Can you push up another branch with the changes?

krisstern commented 2 years ago

That's great @sghill! The branch with the new pom.xml is at https://github.com/krisstern/build-timeout-plugin/tree/preparing-for-new-release-2022

krisstern commented 2 years ago

Hi @sghill New release draft version 1.21 is ready for testing: https://github.com/jenkinsci/build-timeout-plugin/releases/tag/untagged-aea95edaabab5f0616d0

sghill commented 2 years ago

Nice, thanks @krisstern! We have pulled in v1.21. This should be rolling out over the next couple weeks. I'll let you know if anything comes up.

Current plan is to configure like this in a custom initialization plugin:

import hudson.Extension;
import hudson.init.Initializer;
import hudson.plugins.build_timeout.BuildTimeOutOperation;
import hudson.plugins.build_timeout.global.GlobalTimeOutConfiguration;
import hudson.plugins.build_timeout.impl.AbsoluteTimeOutStrategy;
import hudson.plugins.build_timeout.operations.AbortOperation;
import hudson.plugins.build_timeout.operations.WriteDescriptionOperation;
import lombok.extern.slf4j.Slf4j;

import javax.inject.Inject;
import java.time.Duration;
import java.util.LinkedList;
import java.util.List;

import static hudson.init.InitMilestone.SYSTEM_CONFIG_ADAPTED;

@Slf4j
@Extension
public class GlobalBuildTimeoutConfigInit {
    private static final Duration GLOBAL_TIMEOUT = Duration.ofHours(30);
    private GlobalTimeOutConfiguration configuration;

    @Inject
    void setConfiguration(GlobalTimeOutConfiguration configuration) {
        this.configuration = configuration;
    }

    @Initializer(after = SYSTEM_CONFIG_ADAPTED)
    public void init() {
        long hours = GLOBAL_TIMEOUT.toHours();
        String description = String.format("Global timeout of %d hours elapsed", hours);
        configuration.setStrategy(new AbsoluteTimeOutStrategy(String.valueOf(GLOBAL_TIMEOUT.toMinutes())));
        List<BuildTimeOutOperation> operations = new LinkedList<>();
        operations.add(new AbortOperation());
        operations.add(new WriteDescriptionOperation(description));
        configuration.setOperations(operations);
        configuration.save();
        log.info("Builds will abort after {} hours and set description to '{}'", hours, description);
    }
}
sghill commented 2 years ago

Following up on this -- plugin has been rolled out and is working. We've timed out 3 runaway builds so far 🎉

Thanks so much @krisstern!

krisstern commented 2 years ago

You are welcome @sghill