Open timja opened 7 years ago
To clarify: The "ERROR: null" does not happen when the job is submitted to Jenkins - the jobs turn up in the queue as expected and the connection is still there. Only when a job is started by Jenkins does the "Error: null" happen.
I think I've found the problem and I now have a workaround:
In src/main/java/hudson/cli/BuildCommand.java around line 181 there is a comment stating:
// read output in a retry loop, by default try only once
// writeWholeLogTo may fail with FileNotFound
// exception on a slow/busy machine, if it takes
// longish to create the log file
and this still seems to be the issue (i.e. use of polling for existence of a file instead of proper synchronization). When the retry count is exceeded in the loop following the comment an exception is raised and the infamous "ERROR: null" results.
In my scenario the use of LDAP as user/group database for Jenkins makes the time-to-create-the-log longer and much more variable, so the number of retries frequently is insufficient.
There is an undocumented (and according to the code deprecated) option to the CLI "build" subcommand "-r n" which can be used to change the retry limit from the default 10 (* 100ms).
In my specific case I've found the required limit "n" to be >= 200 with a median of 60 retries (a one second delay) actually needed.
Changed the priority to minor, as the underlying problem remains unsolved, but the workaround is good enough for my purposes.
How is LDAP auth related to the delay in writing to disk?
Just anecdotal evidence:
While analyzing the behaviour I found an error in our LDAP configuration that resulted in a much higher request latency than necessary. Fixed that and the error rate went down from about 5:25 to 1:25. The difference in execution time and error rate as documented in the logs seems to have been mostly due to that configuration error, too.
We are using a reverse proxy for Web-SSO and the "project based matrix authorization" with LDAP as user and group database and whenever a job begins to run I observe quite a few LDAP transactions on the network.
This has led me to the assumption that access control is performed during the start of a job and that it has an influence on the time-to-create-the-log. I've not verified that this is actually the case.
In an experimental setup with local users/groups only I was unable to reproduce the issue.
When submitting jobs to Jenkins via SSH (ssh build@jenkins build -v -f job) an intermittent "ERROR: Null" occurs when the job is started on Jenkins.
This error terminates the SSH connection to Jenkins, so no build output nor build status is available - the job is "lost", although it will be executed by Jenkins.
The attached zip file has a minimal Jenkins job, a shell procedure that will trigger the bug and some sample output, as well as the list of plugins active on the jenkins installation.
The Jenkins installation has a mix of ubuntu, sles, osx and windows(cygwin) slaves - 71 machines providing 156 executors all connected via SSH. When run without restriction to a subset of executors the script runs for almost 12 minutes (5 batches with 5 jobs, each with a max duration of 13 seconds!) and the error rate is very high. When run with a restriction that singles out 8 slaves, the same job requires a bit more than 2 minutes and the error rate is much lower.
Originally reported by wjg57, imported from: "ERROR: null" message when building jobs via SSH CLI