jenkinsci / bitbucket-branch-source-plugin

Bitbucket Branch Source Plugin
https://plugins.jenkins.io/cloudbees-bitbucket-branch-source
MIT License
217 stars 353 forks source link

Notifying pull request build result: ERROR: Could not send notifications | java.net.SocketTimeoutException: Read timed out #217

Closed ellieayla closed 5 years ago

ellieayla commented 5 years ago

Your checklist for this issue

Description

I'm seeing a lot of errors updating build status via com.cloudbees.jenkins.plugins.bitbucket.client.BitbucketCloudApiClient.postBuildStatus. The majority of these failures are observed at the end of a build, though some are at the start of one. This was originally discussed in issue #213, but that's focused on retrying idempotent requests (GET branch list).

See also https://issues.jenkins-ci.org/browse/JENKINS-58810

[Bitbucket] Notifying pull request build result
ERROR: Could not send notifications
java.io.IOException: Communication error
    at com.cloudbees.jenkins.plugins.bitbucket.client.BitbucketCloudApiClient.doRequest(BitbucketCloudApiClient.java:861)
    at com.cloudbees.jenkins.plugins.bitbucket.client.BitbucketCloudApiClient.postRequest(BitbucketCloudApiClient.java:902)
    at com.cloudbees.jenkins.plugins.bitbucket.client.BitbucketCloudApiClient.postBuildStatus(BitbucketCloudApiClient.java:607)
    at com.cloudbees.jenkins.plugins.bitbucket.BitbucketChangesetCommentNotifier.buildStatus(BitbucketChangesetCommentNotifier.java:50)
    at com.cloudbees.jenkins.plugins.bitbucket.BitbucketBuildStatusNotifications.createStatus(BitbucketBuildStatusNotifications.java:143)
    at com.cloudbees.jenkins.plugins.bitbucket.BitbucketBuildStatusNotifications.sendNotifications(BitbucketBuildStatusNotifications.java:185)
    at com.cloudbees.jenkins.plugins.bitbucket.BitbucketBuildStatusNotifications.access$000(BitbucketBuildStatusNotifications.java:60)
    at com.cloudbees.jenkins.plugins.bitbucket.BitbucketBuildStatusNotifications$JobCompletedListener.onCompleted(BitbucketBuildStatusNotifications.java:257)
    at hudson.model.listeners.RunListener.fireCompleted(RunListener.java:209)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.finish(WorkflowRun.java:584)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$1000(WorkflowRun.java:133)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:996)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1463)
    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:458)
    at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:37)
    at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
    at sun.security.ssl.InputRecord.read(InputRecord.java:503)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
    at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72)
    at com.cloudbees.jenkins.plugins.bitbucket.client.BitbucketCloudApiClient.executeMethod(BitbucketCloudApiClient.java:759)
    at com.cloudbees.jenkins.plugins.bitbucket.client.BitbucketCloudApiClient.doRequest(BitbucketCloudApiClient.java:843)
    ... 22 more
Finished: SUCCESS

Any build with the failure at the end remains green and the Bitbucket UI will show "0 of 1". Any build with the failure at the start doesn't proceed at all, and no mention of the build will appear in the Bitbucket UI.

The socket timeout above is currently hardcoded to 60 * 1000 milliseconds, aka 1 minute. I see failures like the above logged at ~6 minutes after the last successful "[Bitbucket] Notifying pull request build result" logged.

https://github.com/jenkinsci/bitbucket-branch-source-plugin/blob/7c761939e6b950adbadb08ede737ac3b51470002/src/main/java/com/cloudbees/jenkins/plugins/bitbucket/client/BitbucketCloudApiClient.java#L151

This does not appear to be a symptom of rate limiting. I cranked up the logging level and reproduced a bunch of failures - that LOGGER.fine() about bitbucket API rate limiting isn't logged. And Bitbucket support has indicated that hitting api rate limits should produce HTTP/429.

https://github.com/jenkinsci/bitbucket-branch-source-plugin/blob/7c761939e6b950adbadb08ede737ac3b51470002/src/main/java/com/cloudbees/jenkins/plugins/bitbucket/client/BitbucketCloudApiClient.java#L747-L774

I have not managed to reproduce in a test case w/ a debugger attached (IntelliJ IDEA on Mac). That's an entirely different network stack compared to my real Jenkins server, though. The real one is running in an Ubuntu VM on Azure.

Absent a deterministic reproduction outside of production, I would be willing to install a custom-instrumented plugin. I don't know how to build one yet, or what instrumentation I would add anyway.

Unknowns:

  1. Does the plugin make an outbound network connection from the Jenkins master or from the worker where the build is running? They may have different network paths.
  2. Exactly how much wall time transpires from the start of sendNotifications() to the exception above? (By starting a bunch of jobs and watching a clock, believe 60 seconds.)
  3. How to reproduce deterministically?
ellieayla commented 5 years ago

Conjecture:

A TCP connection to api.bitbucket.org is created at the start of the build (either the first build status notification or fetching some piece of branch/pr metadata), successfully. The build continues on happily, with this TCP connection being held open and idle. Something interferes with this TCP connection - maybe an Azure NAT/Firewall/EgressLB somewhere (eg the Basic SKU LB has a 4 minute idle limit). No RST is sent, and the client has no indication of a problem. A subsequent request (the end-of-build status notification) goes down the existing connection, sitting in the Send Queue and never being acknowledged. Reading the HTTP Response fails with a timeout.

ellieayla commented 5 years ago

I created a new repo with this Jenkinsfile


pipeline {
    options {
        timeout(activity: true, time: 15, unit: 'MINUTES')
        timestamps ()
    }

    agent {
        label 'master'
    }

    stages {
        stage('Start time') {
            steps {
                sh "date -u --rfc-3339=seconds; netstat -top | grep bitbucket"
            }
        }

        stage('Sleep') {
            steps {
                sh "sleep 400"
            }
        }
    }

    post {
        always {
            sh "date -u --rfc-3339=seconds; netstat -top | grep bitbucket"
        }
    }
}

And I ran netstat in a loop every 5 seconds with timestamps on the master node; no pre-exisiting connections:

while [ 1 ]; do date -u --rfc-3339=seconds; sudo netstat -top | grep bitbucket; echo; sleep 5; done
2019-08-20 22:02:18+00:00
2019-08-20 22:02:23+00:00
2019-08-20 22:02:28+00:00

And ran a build on the master node. So the initial notification to api.bitbucket.org went down the socket with local port 36288.

22:02:35  + date -u --rfc-3339=seconds
22:02:35  2019-08-20 22:02:34+00:00
22:02:35  + grep bitbucket
22:02:35  + netstat -top
22:02:35  (Not all processes could be identified, non-owned process info
22:02:35   will not be shown, you would have to be root to see it all.)
22:02:35  tcp        0      0 172.16.0.4:36252        bitbucket.org:https     TIME_WAIT   -                timewait (59.30/0/0)
22:02:35  tcp        0      0 172.16.0.4:36212        bitbucket.org:https     TIME_WAIT   -                timewait (58.26/0/0)
22:02:35  tcp6       0      0 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        off (0.00/0/0)

netstat running under ssh every 5 seconds shows socket with local port 36288 remaining open...

-------------------------     Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name Timer
2019-08-20 22:02:33+00:00     tcp6       0      0 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        off (0.00/0/0)
2019-08-20 22:02:38+00:00     tcp6       0      0 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        off (0.00/0/0)
2019-08-20 22:02:43+00:00     tcp6       0      0 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        off (0.00/0/0)
2019-08-20 22:02:48+00:00     tcp6       0      0 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        off (0.00/0/0)
2019-08-20 22:02:53+00:00     tcp6       0      0 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        off (0.00/0/0)
...
2019-08-20 22:09:05+00:00     tcp6       0      0 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        off (0.00/0/0)
2019-08-20 22:09:10+00:00     tcp6       0      0 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        off (0.00/0/0)
2019-08-20 22:09:15+00:00     tcp6       0      0 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        off (0.00/0/0)
2019-08-20 22:09:20+00:00     tcp6       0      0 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        off (0.00/0/0)
2019-08-20 22:09:25+00:00     tcp6       0      0 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        off (0.00/0/0)

Post-actions run

[Pipeline] { (Declarative: Post Actions)
[Pipeline] sh
22:09:27  + date -u --rfc-3339=seconds
22:09:27  2019-08-20 22:09:27+00:00
22:09:27  + grep bitbucket
22:09:27  + netstat -top
22:09:27  (Not all processes could be identified, non-owned process info
22:09:27   will not be shown, you would have to be root to see it all.)
22:09:27  tcp6       0      0 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        off (0.00/0/0)

The connection sits for 60 seconds with data in the Send-Q buffer, and the timer active.

2019-08-20 22:09:30+00:00     tcp6       0    736 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        on (1.45/3/0)
2019-08-20 22:09:35+00:00     tcp6       0    736 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        on (0.82/4/0)
2019-08-20 22:09:40+00:00     tcp6       0    736 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        on (4.72/5/0)
2019-08-20 22:09:45+00:00     tcp6       0    736 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        on (17.32/6/0)
2019-08-20 22:09:50+00:00     tcp6       0    736 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        on (12.26/6/0)
2019-08-20 22:09:56+00:00     tcp6       0    736 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        on (7.16/6/0)
2019-08-20 22:10:01+00:00     tcp6       0    736 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        on (2.05/6/0)
2019-08-20 22:10:06+00:00     tcp6       0    736 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        on (33.07/7/0)
2019-08-20 22:10:11+00:00     tcp6       0    736 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        on (28.02/7/0)
2019-08-20 22:10:16+00:00     tcp6       0    736 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        on (22.94/7/0)
2019-08-20 22:10:21+00:00     tcp6       0    736 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        on (17.84/7/0)
2019-08-20 22:10:26+00:00     tcp6       0    736 172.16.0.4:36288        api.bitbucket.org:https ESTABLISHED 1912/java        on (12.75/7/0)

The build completes successfully, with the titular exception and no datestamp (please trust my watch)

[Bitbucket] Notifying commit build result
ERROR: Could not send notifications
java.io.IOException: Communication error

The connection has entered FIN_WAIT1 state, where it remains for ~85 seconds before being reaped.

2019-08-20 22:10:31+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (7.69/7/0)
2019-08-20 22:10:36+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (2.64/7/0)
2019-08-20 22:10:41+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (69.25/8/0)
2019-08-20 22:10:46+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (64.18/8/0)
2019-08-20 22:10:51+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (59.11/8/0)
2019-08-20 22:10:56+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (54.04/8/0)
2019-08-20 22:11:02+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (48.93/8/0)
2019-08-20 22:11:07+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (43.83/8/0)
2019-08-20 22:11:12+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (38.76/8/0)
2019-08-20 22:11:17+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (33.66/8/0)
2019-08-20 22:11:22+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (28.56/8/0)
2019-08-20 22:11:27+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (23.40/8/0)
2019-08-20 22:11:32+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (18.30/8/0)
2019-08-20 22:11:37+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (13.24/8/0)
2019-08-20 22:11:42+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (8.17/8/0)
2019-08-20 22:11:47+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (3.07/8/0)
2019-08-20 22:11:53+00:00     tcp6       0    768 172.16.0.4:36288        api.bitbucket.org:https FIN_WAIT1   -                on (0.00/8/0)
2019-08-20 22:11:58+00:00
2019-08-20 22:12:03+00:00
ellieayla commented 5 years ago

Configuration on Jenkins master during above test:

$ sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 45
net.ipv4.tcp_keepalive_probes = 9
joshuabergeron commented 5 years ago

Been experiencing this too, thanks for all the information you put together! I like your idea of setting up a repo with a Jenkinsfile like that. I have been running tcpdump on our master but I have not had a lot of time to test builds/check results today.

I thought it might be worth noting that my environment here is also Jenkins on an Ubuntu VM @ Azure.

ellieayla commented 5 years ago

I believe the default behaviour of PoolingHttpClientConnectionManager is to inspect the server-provided Keep-Alive header and use it as the basis for auto configuring the ConnectionKeepAliveStrategy, and in the absence of such a header assumes the connection can be kept alive indefinitely.

I made a quick and dirty patch in commit cc9103cbc0adeafbc33c57940679a69707bf88b6 to explore whether a more aggressive ConnectionKeepAliveStrategy would be beneficial. I'm running a custom build of this fork.

ellieayla commented 5 years ago

@timpeeters @joshuabergeron @casz What are your thoughts on the above? Any flaws in logic or oversights?

jetersen commented 5 years ago

Wonder why we need to have client kept alive. At most you keep it alive for a few seconds but no more than that.

ellieayla commented 5 years ago

@casz I'm not sure we need to have the keep alive here at all. It seems to only exist as a side-effect of the client library choice. That's why I tried making the limit only 5 seconds in cc9103c.

The requests involved may come from any build, and any of them may be successful (first request, very quick subsequent request) or a victim (after a long delay). Very rarely the client makes multiple requests in rapid succession, but that likelihood increases substantially in a large Jenkins installation with concurrent builds.

But in my tiny Jenkins installation, where I never see more than 5 concurrent builds, it seems nearly useless.

jetersen commented 5 years ago

Open a PR 😕

jetersen commented 5 years ago

v2.4.6 should be in the update center shortly

ellieayla commented 5 years ago

Installed. Thanks guys. :)

albertocsm commented 5 years ago

installed. thanks!

joshuabergeron commented 5 years ago

No problems on my end since this update. Thanks for your work @alanjcastonguay !

tysonbrochu commented 5 years ago

This fixed the problem for us too. Many thanks!

jetersen commented 5 years ago

Remember to hit that watch button at the top and select the level you prefer 😄 I'd suggest watch for Releases only ❤️