timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-65896] Builds mysteriously disappearing during run #1555

Open timja opened 3 years ago

timja commented 3 years ago

I'm primarily looking for advice how to troubleshoot this issue.

We have a Jenkins instance (latest version (core + ~150 plugins). Everything works fine except there is a job, cron-triggered daily. Sometimes the actual build of this job just "disappears": the build is not visible in the UI, the build directory on the disk has the "beginning" of the build log ending in the middle of the build, and build.xml with the build metadata is also completely missing. Jenkins log doesn't contain anything related.

How can I go on  investigating what causes this? Anything I could enable in logging to provide more information?

 


Originally reported by pistahh, imported from: Builds mysteriously disappearing during run
  • status: Open
  • priority: Major
  • resolution: Unresolved
  • imported: 2022/01/10
timja commented 2 years ago

mikenau_intuit:

We experienced the same issue on one of our Jenkins controllers on March 1, 2022 between 9:05am and 10:09am PT.

Our Jenkins environment is

Jenkins Version: 2.289.3
Jenkins Runtime: Jenkins Controller is running in a Kubernetes Cluster
Impacted Build Agents: All impacted build agents are running on AWS Windows Instances that are managed by the https://github.com/jenkinsci/ec2-plugin (we have an internal version of this plugin that is based off of version 1.57)

We had 7 separate builds that were triggered but in each case the build was not visible in the Jenkins UI. When going to JENKINS_HOME the directory for each of the builds were present, but each was missing the build.xml file and the log file was incomplete and truncated. Each of these builds were still running at the time - we were able to confirm this using the build logs in Splunk (we use the Jenkins Splunk plugin). Each of these 7 builds were running on different Windows build agents and each ran between 9:05am and 10:09am PT. 

Here's what the build dir looks like for one of these builds

[root]# pwd
JENKINS_HOME/jobs/21-Release/builds/147

[root]# ls -l
total 516
-rw-r--r--. 1 root root      0 Mar  1 17:05 changelog.xml
-rw-r--r--. 1 root root 518875 Mar  1 18:10 log
drwxr-xr-x. 2 root root   4096 Mar  1 17:05 timestamper

[root]# tail log
[INFO] artifact com.intuit.ctg.formset.ty21.plugins:diagnostics-maven-plugin: checking for updates from Intuit-Releases
[INFO] artifact com.intuit.ctg.formset.ty21.plugins:efiling-maven-plugin: checking for updates from central
[INFO] artifact com.intuit.ctg.formset.ty21.plugins:efiling-maven-plugin: checking for updates from Intuit-Releases
[INFO] artifact com.intuit.ctg.formset.ty21.plugins:efilingxml-maven-plugin: checking for updates from central
[INFO] artifact com.intuit.ctg.formset.ty21.plugins:efilingxml-maven-plugin: checking for updates from Intuit-Releases
[INFO] artifact com.intuit.ctg.formset.ty21.plugins:formscalc-maven-plugin: checking for updates from central
[INFO] artifact com.intuit.ctg.formset.ty21.plugins:formscalc-maven-plugin: checking for updates from Intuit-Releases
[INFO] artifact com.intuit.ctg.formset.ty21.plugins:help-maven-plugin: checking for updates from central
[INFO] artifact com.intuit.ctg.formset.ty21.plugins:help-maven-plugin: checking for updates from Intuit-Releases
[INFO] artifact com.intuit.ctg.formset.ty21.plugins:import-maven-plugin: checking for updates from central

We are not able to find any errors related to these windows instances in any of the Jenkins logs or in AWS CloudTrail during this time - everything looks normal from that perspective. 

We tried to perform both a hard and soft restart to see if the Jenkins UI would show these 7 builds, but no luck. They were still not visible. 

After these 7 builds completed, the associated windows build agents were able to perform other builds successfully and these subsequent builds showed up in the Jenkins UI like normal. 

The number of builds running on this Jenkins controller is consistent with what we typically see - about 100 builds started per hour

 

We run 1000's of windows builds a day on this Jenkins and others that are essentially the same and have never seen any of these issues before. 

Hoping someone has an idea on what might have happened and how we can prevent it from happening in the future. Happy to add more info if needed (just let me know what additional info would be helpful)

timja commented 2 years ago

mikenau_intuit:

After further analysis, it looks like this issue with builds not being present in the UI has been happening for a while in our infrastructure. When this happens, the build folder is not removed from disk by the build discarders we have configured. This allows us to see over time how often this happens by writing a bash script to look for build folders that is missing a build.xml file. We currently run about 200 Jenkins controllers and I spot checked a few of them builds without a build.xml file on quite a few of them. Here's an example of what that looks like

[root@ip-10-47-49-57 ibp]# ll ../jobs/20-FDI-win-CI/builds
total 68
drwxr-xr-x. 3 root root 4096 Dec 12  2020 1346
drwxr-xr-x. 2 root root 4096 May 29  2021 2544
drwxr-xr-x. 2 root root 4096 Feb 25 14:53 3341
drwxr-xr-x. 2 root root 4096 Feb 26 14:52 3342
drwxr-xr-x. 2 root root 4096 Feb 27 20:02 3343
drwxr-xr-x. 2 root root 4096 Feb 28 14:51 3344
drwxr-xr-x. 2 root root 4096 Mar  1 14:51 3345
drwxr-xr-x. 2 root root 4096 Mar  2 14:52 3346
drwxr-xr-x. 2 root root 4096 Mar  3 14:52 3347
drwxr-xr-x. 2 root root 4096 Mar  4 14:51 3348
drwxr-xr-x. 2 root root 4096 Mar  5 14:51 3349
drwxr-xr-x. 2 root root 4096 Mar  6 14:01 3350
drwxr-xr-x. 2 root root 4096 Mar  7 14:51 3351
drwxr-xr-x. 2 root root 4096 Mar  8 14:52 3352
drwxr-xr-x. 2 root root 4096 Mar  9 14:50 3353
drwxr-xr-x. 2 root root 4096 Mar 10 14:52 3354
-rw-r--r--. 1 root root    0 May 18  2020 legacyIds
-rw-r--r--. 1 root root  135 Mar 10 14:52 permalinks

[root@ip-10-47-49-57 ibp]# ll ../jobs/20-FDI-win-CI/builds/1346
total 792
-rw-r--r--. 1 root root     12 Dec 12  2020 changelog.xml
-rw-r--r--. 1 root root 798720 Dec 12  2020 log
drwxr-xr-x. 2 root root   4096 Dec 12  2020 timestamper

[root@ip-10-47-49-57 ibp]# ll ../jobs/20-FDI-win-CI/builds/2544
total 368
-rw-r--r--. 1 root root   1181 May 29  2021 changelog.xml
-rw-r--r--. 1 root root 367492 May 29  2021 log

Notice the timestamps on builds 1346 and 2544 are significantly older. Each of these build directories is missing a build.xml file. We are not able to see these builds in the Jenkins UI. 

We are currently running Jenkins 2.289.3, but we're running previous LTS versions over the past couple of years when this has happened as well. I attached the list of plugins we have installed and the system properties to this ticket. 

I ran the following bash script run against JENKINS_HOME to identify build directories that are missing a build.xml file (replace JENKINS_HOME with the path to your specific JENKINS_HOME). Note this script has not been sufficiently tested and will not handle all scenarios (i.e. paths with spaces or special characters in the name). Running it should not do any harm as it's not adding/editing/deleting anything, but you have been warned

#!/bin/bash

jobs_dir=JENKINS_HOME/jobs

# Find all directories names "builds"
build_dirs=$(find ${jobs_dir} -type d -name "builds" -print)

# Loop through each of the "builds" directories found above
for build_dir in ${build_dirs}; do
  #echo build_dir: ${build_dir}  

  # Look for subdirectories under the "builds" dir that does not have a build.xml file
  build_dirs_without_build_xml=$(find ${build_dir} -maxdepth 1 -mindepth 1 -type d '!' -exec test -e "{}/build.xml" ';' -print)  

  # echo the directory name and it's last modified time
  for build_dir_without_build_xml in ${build_dirs_without_build_xml}; do
    dir_last_modified_time=$(stat -c '%y' ${build_dir_without_build_xml})
    echo "${build_dir_without_build_xml} (${dir_last_modified_time})"
  done
done

I'm surprised there isn't more noise around this issue? Maybe others aren't aware it's happening? We only just realized it is happening and we've been running a large Jenkins fleet for years. 

timja commented 2 years ago

mikenau_intuit:

The following groovy script can be run in the Jenkins script console to find builds that are missing a build.xml file and might be missing from the Jenkins UI

import java.util.logging.Logger

def build_lookback_days = 7
Date build_lookback_date = (new Date() - build_lookback_days)

//def jobs_dir_path = "./_output/jobs"
def jobs_dir_path = "/var/jenkins_home/jobs"

void logMessage(String message) {
    println(message)
    Logger logger = Logger.getLogger("")
    logger.info("builds_missing_build_xml " + message)
}

jobs_dir = new File(jobs_dir_path)

logMessage("jobs_dir_path: " + jobs_dir_path)
logMessage("build_lookback_days: " + build_lookback_days)
logMessage("build_lookback_date: " + build_lookback_date.format("yyyy-MM-dd HH:mm:ss"))

// Recurse the directories in the jobs_dir looking for
jobs_dir.eachDirRecurse {
    // Is this a builds dir?
    if (it.getName() == "builds") {

// Iterate over all the sub directories and
// check if it contains a build.xml file
builds_dir = new File(it.toString())
builds_dir.eachDir {
    File build_xml_file = new File(it.toString() + "/build.xml")
    if (!build_xml_file.exists()) {
Date buildLastModifiedDate = new Date(it.lastModified());

// Is the build in our build_lookback window
if (buildLastModifiedDate > build_lookback_date) {
    logMessage("*** " + it + " (" + buildLastModifiedDate.format("yyyy-MM-dd HH:mm:ss") + ")")
}
else {
    //logMessage("*** IGNORED *** " + it + " (" + buildLastModifiedDate.format("yyyy-MM-dd HH:mm:ss") + ")")
}
    }
}
    }
}

The output is logged to both the Jenkins log and the script console window. Here is a sample of the output from the Jenkins script console

jobs_dir_path: /var/jenkins_home/jobs
build_lookback_days: 21
build_lookback_date: 2022-02-21 22:27:04
*** /var/jenkins_home/jobs/21-MDI-mac-ROB/builds/9 (2022-03-01 18:09:56)
*** /var/jenkins_home/jobs/21-HII-win-CI-GIT/builds/1189 (2022-03-01 18:06:22)
*** /var/jenkins_home/jobs/21-MDI-web-ROB/builds/9 (2022-03-01 18:09:56)
*** /var/jenkins_home/jobs/21-Release/builds/147 (2022-03-01 20:00:12)
*** /var/jenkins_home/jobs/21-GAI-win-ROB/builds/3 (2022-03-01 18:08:20)
*** /var/jenkins_home/jobs/21-MDI-win-ROB/builds/9 (2022-03-01 18:09:56)
*** /var/jenkins_home/jobs/21-BuildManager/builds/6993 (2022-03-01 18:10:02)