Open timja opened 3 years ago
We experienced the same issue on one of our Jenkins controllers on March 1, 2022 between 9:05am and 10:09am PT.
Our Jenkins environment is
Jenkins Version: 2.289.3
Jenkins Runtime: Jenkins Controller is running in a Kubernetes Cluster
Impacted Build Agents: All impacted build agents are running on AWS Windows Instances that are managed by the https://github.com/jenkinsci/ec2-plugin (we have an internal version of this plugin that is based off of version 1.57)
We had 7 separate builds that were triggered but in each case the build was not visible in the Jenkins UI. When going to JENKINS_HOME the directory for each of the builds were present, but each was missing the build.xml file and the log file was incomplete and truncated. Each of these builds were still running at the time - we were able to confirm this using the build logs in Splunk (we use the Jenkins Splunk plugin). Each of these 7 builds were running on different Windows build agents and each ran between 9:05am and 10:09am PT.
Here's what the build dir looks like for one of these builds
[root]# pwd JENKINS_HOME/jobs/21-Release/builds/147 [root]# ls -l total 516 -rw-r--r--. 1 root root 0 Mar 1 17:05 changelog.xml -rw-r--r--. 1 root root 518875 Mar 1 18:10 log drwxr-xr-x. 2 root root 4096 Mar 1 17:05 timestamper [root]# tail log [INFO] artifact com.intuit.ctg.formset.ty21.plugins:diagnostics-maven-plugin: checking for updates from Intuit-Releases [INFO] artifact com.intuit.ctg.formset.ty21.plugins:efiling-maven-plugin: checking for updates from central [INFO] artifact com.intuit.ctg.formset.ty21.plugins:efiling-maven-plugin: checking for updates from Intuit-Releases [INFO] artifact com.intuit.ctg.formset.ty21.plugins:efilingxml-maven-plugin: checking for updates from central [INFO] artifact com.intuit.ctg.formset.ty21.plugins:efilingxml-maven-plugin: checking for updates from Intuit-Releases [INFO] artifact com.intuit.ctg.formset.ty21.plugins:formscalc-maven-plugin: checking for updates from central [INFO] artifact com.intuit.ctg.formset.ty21.plugins:formscalc-maven-plugin: checking for updates from Intuit-Releases [INFO] artifact com.intuit.ctg.formset.ty21.plugins:help-maven-plugin: checking for updates from central [INFO] artifact com.intuit.ctg.formset.ty21.plugins:help-maven-plugin: checking for updates from Intuit-Releases [INFO] artifact com.intuit.ctg.formset.ty21.plugins:import-maven-plugin: checking for updates from central
We are not able to find any errors related to these windows instances in any of the Jenkins logs or in AWS CloudTrail during this time - everything looks normal from that perspective.
We tried to perform both a hard and soft restart to see if the Jenkins UI would show these 7 builds, but no luck. They were still not visible.
After these 7 builds completed, the associated windows build agents were able to perform other builds successfully and these subsequent builds showed up in the Jenkins UI like normal.
The number of builds running on this Jenkins controller is consistent with what we typically see - about 100 builds started per hour
We run 1000's of windows builds a day on this Jenkins and others that are essentially the same and have never seen any of these issues before.
Hoping someone has an idea on what might have happened and how we can prevent it from happening in the future. Happy to add more info if needed (just let me know what additional info would be helpful)
After further analysis, it looks like this issue with builds not being present in the UI has been happening for a while in our infrastructure. When this happens, the build folder is not removed from disk by the build discarders we have configured. This allows us to see over time how often this happens by writing a bash script to look for build folders that is missing a build.xml file. We currently run about 200 Jenkins controllers and I spot checked a few of them builds without a build.xml file on quite a few of them. Here's an example of what that looks like
[root@ip-10-47-49-57 ibp]# ll ../jobs/20-FDI-win-CI/builds total 68 drwxr-xr-x. 3 root root 4096 Dec 12 2020 1346 drwxr-xr-x. 2 root root 4096 May 29 2021 2544 drwxr-xr-x. 2 root root 4096 Feb 25 14:53 3341 drwxr-xr-x. 2 root root 4096 Feb 26 14:52 3342 drwxr-xr-x. 2 root root 4096 Feb 27 20:02 3343 drwxr-xr-x. 2 root root 4096 Feb 28 14:51 3344 drwxr-xr-x. 2 root root 4096 Mar 1 14:51 3345 drwxr-xr-x. 2 root root 4096 Mar 2 14:52 3346 drwxr-xr-x. 2 root root 4096 Mar 3 14:52 3347 drwxr-xr-x. 2 root root 4096 Mar 4 14:51 3348 drwxr-xr-x. 2 root root 4096 Mar 5 14:51 3349 drwxr-xr-x. 2 root root 4096 Mar 6 14:01 3350 drwxr-xr-x. 2 root root 4096 Mar 7 14:51 3351 drwxr-xr-x. 2 root root 4096 Mar 8 14:52 3352 drwxr-xr-x. 2 root root 4096 Mar 9 14:50 3353 drwxr-xr-x. 2 root root 4096 Mar 10 14:52 3354 -rw-r--r--. 1 root root 0 May 18 2020 legacyIds -rw-r--r--. 1 root root 135 Mar 10 14:52 permalinks [root@ip-10-47-49-57 ibp]# ll ../jobs/20-FDI-win-CI/builds/1346 total 792 -rw-r--r--. 1 root root 12 Dec 12 2020 changelog.xml -rw-r--r--. 1 root root 798720 Dec 12 2020 log drwxr-xr-x. 2 root root 4096 Dec 12 2020 timestamper [root@ip-10-47-49-57 ibp]# ll ../jobs/20-FDI-win-CI/builds/2544 total 368 -rw-r--r--. 1 root root 1181 May 29 2021 changelog.xml -rw-r--r--. 1 root root 367492 May 29 2021 log
Notice the timestamps on builds 1346 and 2544 are significantly older. Each of these build directories is missing a build.xml file. We are not able to see these builds in the Jenkins UI.
We are currently running Jenkins 2.289.3, but we're running previous LTS versions over the past couple of years when this has happened as well. I attached the list of plugins we have installed and the system properties to this ticket.
I ran the following bash script run against JENKINS_HOME to identify build directories that are missing a build.xml file (replace JENKINS_HOME with the path to your specific JENKINS_HOME). Note this script has not been sufficiently tested and will not handle all scenarios (i.e. paths with spaces or special characters in the name). Running it should not do any harm as it's not adding/editing/deleting anything, but you have been warned
#!/bin/bash jobs_dir=JENKINS_HOME/jobs # Find all directories names "builds" build_dirs=$(find ${jobs_dir} -type d -name "builds" -print) # Loop through each of the "builds" directories found above for build_dir in ${build_dirs}; do #echo build_dir: ${build_dir} # Look for subdirectories under the "builds" dir that does not have a build.xml file build_dirs_without_build_xml=$(find ${build_dir} -maxdepth 1 -mindepth 1 -type d '!' -exec test -e "{}/build.xml" ';' -print) # echo the directory name and it's last modified time for build_dir_without_build_xml in ${build_dirs_without_build_xml}; do dir_last_modified_time=$(stat -c '%y' ${build_dir_without_build_xml}) echo "${build_dir_without_build_xml} (${dir_last_modified_time})" done done
I'm surprised there isn't more noise around this issue? Maybe others aren't aware it's happening? We only just realized it is happening and we've been running a large Jenkins fleet for years.
The following groovy script can be run in the Jenkins script console to find builds that are missing a build.xml file and might be missing from the Jenkins UI
import java.util.logging.Logger def build_lookback_days = 7 Date build_lookback_date = (new Date() - build_lookback_days) //def jobs_dir_path = "./_output/jobs" def jobs_dir_path = "/var/jenkins_home/jobs" void logMessage(String message) { println(message) Logger logger = Logger.getLogger("") logger.info("builds_missing_build_xml " + message) } jobs_dir = new File(jobs_dir_path) logMessage("jobs_dir_path: " + jobs_dir_path) logMessage("build_lookback_days: " + build_lookback_days) logMessage("build_lookback_date: " + build_lookback_date.format("yyyy-MM-dd HH:mm:ss")) // Recurse the directories in the jobs_dir looking for jobs_dir.eachDirRecurse { // Is this a builds dir? if (it.getName() == "builds") { // Iterate over all the sub directories and // check if it contains a build.xml file builds_dir = new File(it.toString()) builds_dir.eachDir { File build_xml_file = new File(it.toString() + "/build.xml") if (!build_xml_file.exists()) { Date buildLastModifiedDate = new Date(it.lastModified()); // Is the build in our build_lookback window if (buildLastModifiedDate > build_lookback_date) { logMessage("*** " + it + " (" + buildLastModifiedDate.format("yyyy-MM-dd HH:mm:ss") + ")") } else { //logMessage("*** IGNORED *** " + it + " (" + buildLastModifiedDate.format("yyyy-MM-dd HH:mm:ss") + ")") } } } } }
The output is logged to both the Jenkins log and the script console window. Here is a sample of the output from the Jenkins script console
jobs_dir_path: /var/jenkins_home/jobs build_lookback_days: 21 build_lookback_date: 2022-02-21 22:27:04 *** /var/jenkins_home/jobs/21-MDI-mac-ROB/builds/9 (2022-03-01 18:09:56) *** /var/jenkins_home/jobs/21-HII-win-CI-GIT/builds/1189 (2022-03-01 18:06:22) *** /var/jenkins_home/jobs/21-MDI-web-ROB/builds/9 (2022-03-01 18:09:56) *** /var/jenkins_home/jobs/21-Release/builds/147 (2022-03-01 20:00:12) *** /var/jenkins_home/jobs/21-GAI-win-ROB/builds/3 (2022-03-01 18:08:20) *** /var/jenkins_home/jobs/21-MDI-win-ROB/builds/9 (2022-03-01 18:09:56) *** /var/jenkins_home/jobs/21-BuildManager/builds/6993 (2022-03-01 18:10:02)
I'm primarily looking for advice how to troubleshoot this issue.
We have a Jenkins instance (latest version (core + ~150 plugins). Everything works fine except there is a job, cron-triggered daily. Sometimes the actual build of this job just "disappears": the build is not visible in the UI, the build directory on the disk has the "beginning" of the build log ending in the middle of the build, and build.xml with the build metadata is also completely missing. Jenkins log doesn't contain anything related.
How can I go on investigating what causes this? Anything I could enable in logging to provide more information?
Originally reported by pistahh, imported from: Builds mysteriously disappearing during run