timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-45545] "ERROR: null" message when building jobs via SSH CLI #9246

Open timja opened 7 years ago

timja commented 7 years ago

When submitting jobs to Jenkins via SSH (ssh build@jenkins build -v -f job) an intermittent "ERROR: Null" occurs when the job is started on Jenkins.

This error terminates the SSH connection to Jenkins, so no build output nor build status is available - the job is "lost", although it will be executed by Jenkins.

The attached zip file has a minimal Jenkins job, a shell procedure that will trigger the bug and some sample output, as well as the list of plugins active on the jenkins installation.

The Jenkins installation has a mix of ubuntu, sles, osx and windows(cygwin) slaves - 71 machines providing 156 executors all connected via SSH. When run without restriction to a subset of executors the script runs for almost 12 minutes (5 batches with 5 jobs, each with a max duration of 13 seconds!) and the error rate is very high. When run with a restriction that singles out 8 slaves, the same job requires a bit more than 2 minutes and the error rate is much lower.


Originally reported by wjg57, imported from: "ERROR: null" message when building jobs via SSH CLI
  • status: Open
  • priority: Minor
  • resolution: Unresolved
  • imported: 2022/01/10
timja commented 7 years ago

wjg57:

To clarify: The "ERROR: null" does not happen when the job is submitted to Jenkins - the jobs turn up in the queue as expected and the connection is still there. Only when a job is started by Jenkins does the "Error: null" happen.

timja commented 7 years ago

wjg57:

I think I've found the problem and I now have a workaround:

In src/main/java/hudson/cli/BuildCommand.java around line 181 there is a comment stating:

// read output in a retry loop, by default try only once
// writeWholeLogTo may fail with FileNotFound
// exception on a slow/busy machine, if it takes
// longish to create the log file

and this still seems to be the issue (i.e. use of polling for existence of a file instead of proper synchronization). When the retry count is exceeded in the loop following the comment an exception is raised and the infamous "ERROR: null" results.

In my scenario the use of LDAP as user/group database for Jenkins makes the time-to-create-the-log longer and much more variable, so the number of retries frequently is insufficient.

There is an undocumented (and according to the code deprecated) option to the CLI "build" subcommand "-r n" which can be used to change the retry limit from the default 10 (* 100ms).

In my specific case I've found the required limit "n" to be >= 200 with a median of 60 retries (a one second delay) actually needed.

 

timja commented 7 years ago

wjg57:

Changed the priority to minor, as the underlying problem remains unsolved, but the workaround is good enough for my purposes.

timja commented 7 years ago

danielbeck:

How is LDAP auth related to the delay in writing to disk?

timja commented 7 years ago

wjg57:

Just anecdotal evidence:

While analyzing the behaviour I found an error in our LDAP configuration that resulted in a much higher request latency than necessary. Fixed that and the error rate went down from about 5:25 to 1:25. The difference in execution time and error rate as documented in the logs seems to have been mostly due to that configuration error, too.

We are using a reverse proxy for Web-SSO and the "project based matrix authorization" with LDAP as user and group database and whenever a job begins to run I observe quite a few LDAP transactions on the network.

This has led me to the assumption that access control is performed during the start of a job and that it has an influence on the time-to-create-the-log. I've not verified that this is actually the case.

In an experimental setup with local users/groups only I was unable to reproduce the issue.