Closed ellieayla closed 5 years ago
Conjecture:
A TCP connection to api.bitbucket.org is created at the start of the build (either the first build status notification or fetching some piece of branch/pr metadata), successfully. The build continues on happily, with this TCP connection being held open and idle. Something interferes with this TCP connection - maybe an Azure NAT/Firewall/EgressLB somewhere (eg the Basic SKU LB has a 4 minute idle limit). No RST is sent, and the client has no indication of a problem. A subsequent request (the end-of-build status notification) goes down the existing connection, sitting in the Send Queue and never being acknowledged. Reading the HTTP Response fails with a timeout.
I created a new repo with this Jenkinsfile
pipeline {
options {
timeout(activity: true, time: 15, unit: 'MINUTES')
timestamps ()
}
agent {
label 'master'
}
stages {
stage('Start time') {
steps {
sh "date -u --rfc-3339=seconds; netstat -top | grep bitbucket"
}
}
stage('Sleep') {
steps {
sh "sleep 400"
}
}
}
post {
always {
sh "date -u --rfc-3339=seconds; netstat -top | grep bitbucket"
}
}
}
And I ran netstat
in a loop every 5 seconds with timestamps on the master node; no pre-exisiting connections:
while [ 1 ]; do date -u --rfc-3339=seconds; sudo netstat -top | grep bitbucket; echo; sleep 5; done
2019-08-20 22:02:18+00:00
2019-08-20 22:02:23+00:00
2019-08-20 22:02:28+00:00
And ran a build on the master node. So the initial notification to api.bitbucket.org went down the socket with local port 36288.
22:02:35 + date -u --rfc-3339=seconds
22:02:35 2019-08-20 22:02:34+00:00
22:02:35 + grep bitbucket
22:02:35 + netstat -top
22:02:35 (Not all processes could be identified, non-owned process info
22:02:35 will not be shown, you would have to be root to see it all.)
22:02:35 tcp 0 0 172.16.0.4:36252 bitbucket.org:https TIME_WAIT - timewait (59.30/0/0)
22:02:35 tcp 0 0 172.16.0.4:36212 bitbucket.org:https TIME_WAIT - timewait (58.26/0/0)
22:02:35 tcp6 0 0 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java off (0.00/0/0)
netstat running under ssh every 5 seconds shows socket with local port 36288 remaining open...
------------------------- Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
2019-08-20 22:02:33+00:00 tcp6 0 0 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java off (0.00/0/0)
2019-08-20 22:02:38+00:00 tcp6 0 0 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java off (0.00/0/0)
2019-08-20 22:02:43+00:00 tcp6 0 0 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java off (0.00/0/0)
2019-08-20 22:02:48+00:00 tcp6 0 0 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java off (0.00/0/0)
2019-08-20 22:02:53+00:00 tcp6 0 0 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java off (0.00/0/0)
...
2019-08-20 22:09:05+00:00 tcp6 0 0 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java off (0.00/0/0)
2019-08-20 22:09:10+00:00 tcp6 0 0 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java off (0.00/0/0)
2019-08-20 22:09:15+00:00 tcp6 0 0 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java off (0.00/0/0)
2019-08-20 22:09:20+00:00 tcp6 0 0 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java off (0.00/0/0)
2019-08-20 22:09:25+00:00 tcp6 0 0 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java off (0.00/0/0)
Post-actions run
[Pipeline] { (Declarative: Post Actions)
[Pipeline] sh
22:09:27 + date -u --rfc-3339=seconds
22:09:27 2019-08-20 22:09:27+00:00
22:09:27 + grep bitbucket
22:09:27 + netstat -top
22:09:27 (Not all processes could be identified, non-owned process info
22:09:27 will not be shown, you would have to be root to see it all.)
22:09:27 tcp6 0 0 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java off (0.00/0/0)
The connection sits for 60 seconds with data in the Send-Q buffer, and the timer active.
2019-08-20 22:09:30+00:00 tcp6 0 736 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java on (1.45/3/0)
2019-08-20 22:09:35+00:00 tcp6 0 736 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java on (0.82/4/0)
2019-08-20 22:09:40+00:00 tcp6 0 736 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java on (4.72/5/0)
2019-08-20 22:09:45+00:00 tcp6 0 736 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java on (17.32/6/0)
2019-08-20 22:09:50+00:00 tcp6 0 736 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java on (12.26/6/0)
2019-08-20 22:09:56+00:00 tcp6 0 736 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java on (7.16/6/0)
2019-08-20 22:10:01+00:00 tcp6 0 736 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java on (2.05/6/0)
2019-08-20 22:10:06+00:00 tcp6 0 736 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java on (33.07/7/0)
2019-08-20 22:10:11+00:00 tcp6 0 736 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java on (28.02/7/0)
2019-08-20 22:10:16+00:00 tcp6 0 736 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java on (22.94/7/0)
2019-08-20 22:10:21+00:00 tcp6 0 736 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java on (17.84/7/0)
2019-08-20 22:10:26+00:00 tcp6 0 736 172.16.0.4:36288 api.bitbucket.org:https ESTABLISHED 1912/java on (12.75/7/0)
The build completes successfully, with the titular exception and no datestamp (please trust my watch)
[Bitbucket] Notifying commit build result
ERROR: Could not send notifications
java.io.IOException: Communication error
The connection has entered FIN_WAIT1
state, where it remains for ~85 seconds before being reaped.
2019-08-20 22:10:31+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (7.69/7/0)
2019-08-20 22:10:36+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (2.64/7/0)
2019-08-20 22:10:41+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (69.25/8/0)
2019-08-20 22:10:46+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (64.18/8/0)
2019-08-20 22:10:51+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (59.11/8/0)
2019-08-20 22:10:56+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (54.04/8/0)
2019-08-20 22:11:02+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (48.93/8/0)
2019-08-20 22:11:07+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (43.83/8/0)
2019-08-20 22:11:12+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (38.76/8/0)
2019-08-20 22:11:17+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (33.66/8/0)
2019-08-20 22:11:22+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (28.56/8/0)
2019-08-20 22:11:27+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (23.40/8/0)
2019-08-20 22:11:32+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (18.30/8/0)
2019-08-20 22:11:37+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (13.24/8/0)
2019-08-20 22:11:42+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (8.17/8/0)
2019-08-20 22:11:47+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (3.07/8/0)
2019-08-20 22:11:53+00:00 tcp6 0 768 172.16.0.4:36288 api.bitbucket.org:https FIN_WAIT1 - on (0.00/8/0)
2019-08-20 22:11:58+00:00
2019-08-20 22:12:03+00:00
Configuration on Jenkins master during above test:
$ sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 45
net.ipv4.tcp_keepalive_probes = 9
Been experiencing this too, thanks for all the information you put together! I like your idea of setting up a repo with a Jenkinsfile like that. I have been running tcpdump on our master but I have not had a lot of time to test builds/check results today.
I thought it might be worth noting that my environment here is also Jenkins on an Ubuntu VM @ Azure.
I believe the default behaviour of PoolingHttpClientConnectionManager
is to inspect the server-provided Keep-Alive
header and use it as the basis for auto configuring the ConnectionKeepAliveStrategy
, and in the absence of such a header assumes the connection can be kept alive indefinitely.
I made a quick and dirty patch in commit cc9103cbc0adeafbc33c57940679a69707bf88b6 to explore whether a more aggressive ConnectionKeepAliveStrategy
would be beneficial. I'm running a custom build of this fork.
@timpeeters @joshuabergeron @casz What are your thoughts on the above? Any flaws in logic or oversights?
Wonder why we need to have client kept alive. At most you keep it alive for a few seconds but no more than that.
@casz I'm not sure we need to have the keep alive here at all. It seems to only exist as a side-effect of the client library choice. That's why I tried making the limit only 5 seconds in cc9103c.
The requests involved may come from any build, and any of them may be successful (first request, very quick subsequent request) or a victim (after a long delay). Very rarely the client makes multiple requests in rapid succession, but that likelihood increases substantially in a large Jenkins installation with concurrent builds.
But in my tiny Jenkins installation, where I never see more than 5 concurrent builds, it seems nearly useless.
Open a PR 😕
Installed. Thanks guys. :)
installed. thanks!
No problems on my end since this update. Thanks for your work @alanjcastonguay !
This fixed the problem for us too. Many thanks!
Remember to hit that watch
button at the top and select the level you prefer 😄
I'd suggest watch for Releases only
❤️
Your checklist for this issue
[X] Jenkins version: 2.176.1
[X] Plugin version: org.jenkins-ci.plugins:cloudbees-bitbucket-branch-source:2.4.5
[X] Bitbucket cloud: yes
[ ] Bitbucket server and version
Description
I'm seeing a lot of errors updating build status via
com.cloudbees.jenkins.plugins.bitbucket.client.BitbucketCloudApiClient.postBuildStatus
. The majority of these failures are observed at the end of a build, though some are at the start of one. This was originally discussed in issue #213, but that's focused on retrying idempotent requests (GET branch list).See also https://issues.jenkins-ci.org/browse/JENKINS-58810
Any build with the failure at the end remains green and the Bitbucket UI will show "0 of 1". Any build with the failure at the start doesn't proceed at all, and no mention of the build will appear in the Bitbucket UI.
The socket timeout above is currently hardcoded to 60 * 1000 milliseconds, aka 1 minute. I see failures like the above logged at ~6 minutes after the last successful "[Bitbucket] Notifying pull request build result" logged.
https://github.com/jenkinsci/bitbucket-branch-source-plugin/blob/7c761939e6b950adbadb08ede737ac3b51470002/src/main/java/com/cloudbees/jenkins/plugins/bitbucket/client/BitbucketCloudApiClient.java#L151
This does not appear to be a symptom of rate limiting. I cranked up the logging level and reproduced a bunch of failures - that LOGGER.fine() about bitbucket API rate limiting isn't logged. And Bitbucket support has indicated that hitting api rate limits should produce HTTP/429.
https://github.com/jenkinsci/bitbucket-branch-source-plugin/blob/7c761939e6b950adbadb08ede737ac3b51470002/src/main/java/com/cloudbees/jenkins/plugins/bitbucket/client/BitbucketCloudApiClient.java#L747-L774
I have not managed to reproduce in a test case w/ a debugger attached (IntelliJ IDEA on Mac). That's an entirely different network stack compared to my real Jenkins server, though. The real one is running in an Ubuntu VM on Azure.
Absent a deterministic reproduction outside of production, I would be willing to install a custom-instrumented plugin. I don't know how to build one yet, or what instrumentation I would add anyway.
Unknowns:
sendNotifications()
to the exception above? (By starting a bunch of jobs and watching a clock, believe 60 seconds.)