Closed samrocketman closed 7 months ago
I plan to propose a patch for this as well but going to let this bug report get reviewed before I do anything.
I have no objection to tuning the retries via system properties.
Still, I'm surprised that that the retries are needed - the whole retry functionality was originally added to try to deal with flakiness in the Java 8 URLConnection
. In my original testing, okhttp
seemed to not have this issue or handled this problem internally.
I’ve been running similar Jenkins infrastructure for the past 6 years without issue. But the past 4 months seems like GitHub is changing something on their end to manage the outages. Jenkins reliability has gone down in my instance for processing builds.
REST usage has been a challenge as well. We actually get dropped builds pretty regularly due to the amount of fragility of relying on the REST APIs. Through debugging I’ve created a diagram to highlight problematic areas with red arrows. I’ve been rolling out internal patches to address it and some I have surfaced in plugins. I am going to open more patches as I find time (off hours).
Here’s some new articles; we’re heavily hitting secondary rate limit with github autostatus plugin (I am removing next maintenance) and with github branch source plugin.
We’ve recently started getting shared pipeline clone failures too (Git is now getting more limited) but I was able to solve that with the caching feature of groovy shared libraries config. All that’s to say, the need for tuning against GitHub seems like it is going up.
/rate_limit appears to be hit a lot (several times a second) by GitHub branch source plugin on our instance. I haven’t graphed the frequency.
@samrocketman Thanks for detailed information. It definitely sounds like we need to update the core flow.
I'm particularly surprised that /rate_limit
is hit a lot. ... Perhaps the improved rate limit checking that I implemented in this library weren't wired up in the branch source plugin?
I could implement debug logging again and try to get a feel for the frequency. I'll let you know this coming week. I will try to get you some hard stats (number of jobs, frequency of incoming hooks, and try to get a feel for frequency)
Rate limit API is always hit in two scenarios. Where there is legitimate rate limit checking and on the GitHub App credential type when it checks to see if the credential is valid. It is the second type that is most frequent because Jenkins appears to check if the credential is valid a lot.
Here's another example of flakiness which happens quite often and before you run out of API limit.
Hi @samrocketman @bitwiseman,
Our Jenkins instance is also particularly impacted by this new Github rate limiting since 1 month.
We already applied some workarounds like the one suggested in https://github.com/hub4j/github-api/issues/1728#issuecomment-1771970989 to cache shared libraries, but even with that, around 15/20% of our builds still fail.
Did you make any progress on this or did you find a workaround ?
We didn't really experiment the NoThrottle
config on the github-branch-source-plugin, and are still using ThrottleOnOver
because I didn't fully understand the impact and if this change was promising or not.
Anyway, we're ready to help on the investigation & fix, if you have any clue !
Thanks a lot,
@samrocketman @KeepItSimpleStupid What is the rest of the log output for the exception? The nested exception may hold more information...
@KeepItSimpleStupid you don't want to use NoThrottle
because it checks credential validity against the meta API endpoint. I found out it ate through my entire API limit when I configured it. So when I discovered that paired with other comments I switched back to ThrottleOnOver. Still not ideal but its credential check behaves a bit better (it prob shouldn't attempt a check at all once configured but this is a different matter).
Thanks for the explanation @samrocketman :)
@bitwiseman : here are some logs extracted from our Jenkins Controller
On the Jenkins agent side, it materializes as possible timeouts or errors for all actions involving our GitHub app credentials and the GitHub API : checkouts, publishing github checks, tagging a commit, etc
Some examples
It should be possible to have more verbose logs for both this lib and the Jenkins GitHub Branch Source plugin, let me know !
@KeepItSimpleStupid I posted a diagram here https://github.com/hub4j/github-api/issues/1728#issuecomment-1771938828
Invalid scan credentials… skipping
is the third red vertical line (from left to right) connecting to GitHubFor unknown reason, the app continuously checks its own validity using the rate limit API and gets throttled by GitHub. Method Connector.isCredentialValid
is called often and is likely the cause of my performance issues (aside from being able to tweak this library retry behavior).
I am considering releasing an internal fork of GHBS to hardcode return true for this.
@samrocketman
We could improve matters from the github-api
side by changing getRateLimit()
and getMeta()
to cache results for some configurable number of seconds.
If a consumer queries those endpoints more than once per second, the client is doing something wrong and the github-api
library should mitigate that behavior. How does that sound?
Separately, if possible, whatever functionality GHBS is trying to provide with its isCredentialValid()
should be moved to github-api
and GHBS should switch back to using GitHub.isCredentialValid()
. But that is a later change once we halt the spamming.
@bitwiseman this would definitely improve things. I think being able to tweak the retries is important too because once you get to a certain size (10s of thousands of jobs) I've found that inevitably I need to tweak how Jenkins interacts with GitHub from an API client perspective. I've been overhauling internally what I can but these two changes:
Would greatly improve my situation.
@bitwiseman in particular it is this line which gets called excessively; should this be updated on GHBS? https://github.com/jenkinsci/github-branch-source-plugin/blob/7bec28752f4d250713a28dfe08fff4ebb7e4ee35/src/main/java/org/jenkinsci/plugins/github_branch_source/Connector.java#L557
I don't understand why isCredentialValid
would need to be called at all; I can understand validating the credentials upon first configuration but once it is already configured they should just be assumed valid. The user will find out they're invalid if they get revoked (i.e. they'll get permission denied errors in logs).
Here's where it gets called the most https://github.com/jenkinsci/github-branch-source-plugin/blob/90e17c48a6accf2a1a0f189131607261c84e304a/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMNavigator.java#L950-L952
These are similar logs that @KeepItSimpleStupid shared as well
How users see this is a dropped webhook. They open a PR and builds never start because a PR job is never created; the MBP couldn't find "valid" credentials and so skips creating the job for the ref.
In the case of users, they have to manually close and re-open the PR to try again with another webhook.
Even worse when this happens with Tag pipelines. A tag pipeline will not get created and the only resolution is a GitHub admin "replaying" the webhook of the tag push event. Users can't resend these events themselves so this creates excessive support burden on internal GH teams.
Overall the above adds to the perception that our services are poorly managed internally which is why I'm pushing out internal patches to hard-code these areas.
I've edited the above comments a few times; just mentioning I'm done editing it
@bitwiseman in particular it is this line which gets called excessively; should this be updated on GHBS? https://github.com/jenkinsci/github-branch-source-plugin/blob/7bec28752f4d250713a28dfe08fff4ebb7e4ee35/src/main/java/org/jenkinsci/plugins/github_branch_source/Connector.java#L557
I'd say yes. Basically, whatever clever things GHBS is doing should be done on the github-api side in isCredentialValid()
.
I don't understand why
isCredentialValid
would need to be called at all; I can understand validating the credentials upon first configuration but once it is already configured they should just be assumed valid. The user will find out they're invalid if they get revoked (i.e. they'll get permission denied errors in logs).
There's a lot about the SCM class structure design that is overly complex, which results in some odd/poor behavior. I could (and have previously) spent way too much time debugging and cleaning up GHBS code. Often I found the design of SCM classes themselves prevented deeper clean up.
Here's where it gets called the most https://github.com/jenkinsci/github-branch-source-plugin/blob/90e17c48a6accf2a1a0f189131607261c84e304a/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMNavigator.java#L950-L952
These are similar logs that @KeepItSimpleStupid shared as well
From user perspective
How users see this is a dropped webhook. They open a PR and builds never start because a PR job is never created; the MBP couldn't find "valid" credentials and so skips creating the job for the ref.
In the case of users, they have to manually close and re-open the PR to try again with another webhook.
Even worse when this happens with Tag pipelines. A tag pipeline will not get created and the only resolution is a GitHub admin "replaying" the webhook of the tag push event. Users can't resend these events themselves so this creates excessive support burden on internal GH teams.
Overall the above adds to the perception that our services are poorly managed internally which is why I'm pushing out internal patches to hard-code these areas.
I'm sorry to hear you're having to push out internal patches to stabilize your system. Hopefully the updates I'm doing in #1744 will address these, or otherwise allow you to mitigate them more easily.
No worries, @bitwiseman I didn't mean to come off as complaining. I think Jenkins code complexity is a good reason why something like this could happen.
I think my confusion stemmed from thinking it was intentional design but your point is valid it couldn't have been intended. It's pretty clear I'm also hitting up against the edges of performance which means I will hit edge case bugs most will not.
So reporting (and at times trying to fix) issues like this hopefully make Jenkins and supporting libs overall better for all.
I can close this issue once I verify its release of github-api plugin
I just deployed the incremental build 1.318-454.v6624edb_76df9
of the github-api plugin in our infra : I will let you know the results in a few days.
Thanks again !
Since plugin is released I'll close this; I'll report back my usage here and if any issues I'll re-open. Thanks @bitwiseman
@samrocketman https://github.com/hub4j/github-api/issues/1756 - The changes caused test failures in GHBS. You'll need to install the change manually.
@bitwiseman when you mean manually are you referring to downloading the HPI? Currently, I install plugins via maven (group/artifact/version) instead of JUC.
If the artifact is still available in https://repo.jenkins-ci.org I should be able to install it fine. WRT test failures does this mean it won't be published to JUC?
For the time being I'll reopen this issue until a release is GA for others.
The new version is still available in maven, I think it is just excluded from JUC.
This change is available in JUC (https://github.com/jenkinsci/bom/pull/2694) . Please wait to close until you've tried it locally and can report results.
@samrocketman @KeepItSimpleStupid Any updates? I'm very interested to hear what effect the changes had.
Hi @bitwiseman !
Unfortunately, I don't see any improvement in our infrastructure for the moment : still similar amounts of org.kohsuke.github.GitHubClient#logRetryConnectionError
per day in our Jenkins controller logs :/
I'll try to find some time tomorrow to enable some debug logs to trace every requests made by our Jenkins instance to the GitHub APIs and better understand what's going on, in a similar way that @samrocketman did in https://github.com/hub4j/github-api/issues/1728#issuecomment-1771938828
Less than 15 developers use our Jenkins instance every day (+ some automation with Renovate/Dependabot) so we can't say our usage is extraordinary : all of that is really strange !
@KeepItSimpleStupid
😭 Darn. Please provide the Jenkins controller log like you did previously. There should be some improved logging even for that.
Speaking of which, your previous logs only showed a few cases where your were querying more than once per second, so it would make sense that you haven't seen much difference. Hm...
@KeepItSimpleStupid are you customizing the startup properties? There's a few you need to set to benefit from the PR fixes
It's the retrying with random delays that will help.
Set
@samrocketman @KeepItSimpleStupid What's the word?
I'm sorry, guys, some emergencies to handle on my side these last days : I'll give you a more detailed feedback tonight !
I plan to roll this out within the next two weeks and report back.
So here are some fresh logs (curated), more detailed now thanks to what you added in the pull request ;)
Then, to answer to @samrocketman : I didn't customize anything (yet), I was hoping that the sanity caching would be more than enough for our usage which seems far less intensive than yours, but it seems I was wrong. I'll apply your suggested values tomorrow morning and see how it goes !
Thanks again to both of you, I'll make sure to give feedback more quickly about my next iterations :)
@KeepItSimpleStupid specifically if you want it to apply at boot then you can pass the following JVM options
-Dorg.kohsuke.github.GitHubClient.minRetryInterval=1000
-Dorg.kohsuke.github.GitHubClient.maxRetryInterval=3000
-Dorg.kohsuke.github.GitHubClient.retryCount=60
Or from the script console on the fly without restarting
System.setProperty("org.kohsuke.github.GitHubClient.minRetryInterval", "1000")
System.setProperty("org.kohsuke.github.GitHubClient.maxRetryInterval", "3000")
System.setProperty("org.kohsuke.github.GitHubClient.retryCount", "60")
The way Liam wrote that should work.
Deployed this morning by customizing the Jenkins helm chart with the value controller.javaOpts
controller:
javaOpts: >-
-Dorg.kohsuke.github.GitHubClient.minRetryInterval=1000
-Dorg.kohsuke.github.GitHubClient.maxRetryInterval=3000
-Dorg.kohsuke.github.GitHubClient.retryCount=60
I will give more feedback later, but first occurrences seem to be OK : we stopped before the 60 retries
That checks out based on my experience. Here’s info about secondary limits https://docs.github.com/en/rest/overview/rate-limits-for-the-rest-api?apiVersion=2022-11-28#about-secondary-rate-limits
And upon reading Jenkins easily uses up 900 points per minute at times for REST. So retrying taking over a minute kind of makes sense. Mine would retry up to 29 times. Looks like yours stops around 22 retries.
Oh ! I was aware of these secondary limits, but didn't know about this "point system".
I was still really surprised that our common Jenkins usage triggers such severe rate limiting on GitHub side so I spent some time today to collect Github audit log of API requests ... And it was worth it !
I discovered that a small in-house workload (it displays the version of the components deployed in our different environments and their GitHub release notes) was triggering 1000+ requests in the same second to the GitHub API, and doing that every 30 minutes.
After each volley of requests, I was able to observe our now-common org.kohsuke.github.GitHubClient#logRetryConnectionError
logs in our Jenkins controller logs for 5 minutes as shown in my previous comment.
I stopped the in-house workload at the end of the morning, and didn't observe a single occurrence in our Jenkins controller logs since :)
This in-house workload authenticates with a technical Github account, and not with the same GitHub app used by our Jenkins instance, but both are hosted in the same Kubernetes cluster, and so use the same source IP to request the GitHub API : i suppose that the in-house workload was triggering some secondary limits based on the source IP, and so our Jenkins instance was just a collateral damage of this bad behavior.
I'll wait to observe the behavior during some weekdays with our real Jenkins usage and report here.
Thanks again @bitwiseman @samrocketman, even if Jenkins was not the culprit for my use case, the enhancements to this library and the discussions with you really helped me to finally get to the bottom of it !
Since I bake our infrastructure I decided to bake the tuning properties into init.groovy.d Jenkins hook scripts. That's one of the advantages of being able to tune this on the fly @bitwiseman ; static properties cannot be modified with init.groovy.d because it happens too late in the JVM startup. It reduces the amount of JVM tuning options I have to pass.
Here's my startup script $JENKINS_HOME/init.groovy.d/boot_77_github-api_client_tuning.groovy
/**
Tune github-api client https://github.com/hub4j/github-api/issues/1728#issuecomment-1846337277
defaults: minRetryInterval=100, maxRetryInterval=100, retryCount=2
*/
System.setProperty("org.kohsuke.github.GitHubClient.minRetryInterval", "1000")
System.setProperty("org.kohsuke.github.GitHubClient.maxRetryInterval", "3000")
System.setProperty("org.kohsuke.github.GitHubClient.retryCount", "60")
/**
Tune scm-filter-jervis plugin client
https://github.com/jenkinsci/scm-filter-jervis-plugin/blob/80e28289ede66fa18553b4e0ca5f518a4bd782bc/src/main/groovy/net/gleske/scmfilter/impl/trait/JervisFilterTrait.groovy#L55-L79
defaults: minRetryInterval=1000, maxRetryInterval=3000, retryLimit=30
*/
//System.setProperty("net.gleske.scmfilter.impl.trait.JervisFilterTrait.minRetryInterval", "1000")
//System.setProperty("net.gleske.scmfilter.impl.trait.JervisFilterTrait.maxRetryInterval", "3000")
System.setProperty("net.gleske.scmfilter.impl.trait.JervisFilterTrait.retryLimit", "60")
This evening after some testing I've rolled the lib out to production.
One thing I notice is that the GitHub checks plugin https://plugins.jenkins.io/github-checks/ still inconsistently submits the Jenkins
context. For example, it will submit a pending context to the commit with log links to GitHub checks but when the build completes the GitHub checks status shows pending after build is no longer running. I'm still investigating this to make sure I have everything set up correctly.
I've narrowed down the likely problematic area to https://github.com/jenkinsci/github-checks-plugin/blob/c3c161b5f741f921d818a76d10698a5e260d5255/src/main/java/io/jenkins/plugins/checks/github/GitHubChecksPublisher.java#L81-L82 with a source review. I've enabled debug logs for this class.
org.kohsuke.github.GitHub
.GHRepository.updateCheckRun
GHCheckRunBuilder
GitHub.createRequest
(where client
appears to be GitHubClient
)GitHubRequest.method
I'm having trouble reviewing exactly how GitHubClient is used but somehow it does not appear to be using the retry logic. Hopefully, I'll learn more as debug logs come in.
I will say that this last bit (GitHub checks) is merely a UI bug for my users now. They just get a little confused when it doesn't vote back but they're not blocked. Your retry logic fixes the critical issue of jobs not getting created at all in MBP which is huge :grin: .
I'll keep this open to track fixes for GitHub checks plugin, though.
I notice lots of (7e0e663e) Missing or malformed X-RateLimit header: X-RateLimit-Limit: null
in logs
What is the URL for the null rate limit errors?
Here's a few examples from today
2023-12-15 15:25:01.100+0000 [id=1354381] FINER org.kohsuke.github.GitHubClient#logResponse: (332cf569) GitHub API response: https://api.github.com/app
2023-12-15 15:25:01.100+0000 [id=1354381] FINER org.kohsuke.github.GitHubClient#noteRateLimit: (332cf569) Missing or malformed X-RateLimit header: X-RateLimit-Limit: null
2023-12-15 15:25:01.101+0000 [id=1354381] FINE org.kohsuke.github.GitHubClient#logRequest: (35bf107a) GitHub API request: GET https://api.github.com/app/installations
2023-12-15 15:25:01.158+0000 [id=1354381] FINER org.kohsuke.github.GitHubClient#logResponse: (35bf107a) GitHub API response: https://api.github.com/app/installations
2023-12-15 15:25:01.158+0000 [id=1354381] FINER org.kohsuke.github.GitHubClient#noteRateLimit: (35bf107a) Missing or malformed X-RateLimit header: X-RateLimit-Limit: null
2023-12-15 15:25:01.159+0000 [id=1354381] FINE org.kohsuke.github.GitHubClient#logRequest: (3c31430e) GitHub API request: POST https://api.github.com/app/installations/REDACTED_ID/access_tokens
2023-12-15 15:25:01.254+0000 [id=1354381] FINER org.kohsuke.github.GitHubClient#logResponse: (3c31430e) GitHub API response: https://api.github.com/app/installations/REDACTED_ID/access_tokens
2023-12-15 15:25:01.254+0000 [id=1354381] FINER org.kohsuke.github.GitHubClient#noteRateLimit: (3c31430e) Missing or malformed X-RateLimit header: X-RateLimit-Limit: null
However, I keep 1000 copies of debug logs (and use system logrotate and gzip to compress logs daily) so I have the logs from yesterday when I made that comment as well. This is so I can keep 30 days of debug logs for investigations and not run out of storage (compression is absolutely necessary).
Here's the logs from yesterday around the time I noted the issue.
2023-12-14 17:35:13.195+0000 [id=563751] FINER org.kohsuke.github.GitHubClient#logResponse: (5ee9ab96) GitHub API response: https://api.github.com/repos/REDACT_ORG/REDACT_REPO
2023-12-14 17:35:13.195+0000 [id=563751] FINER org.kohsuke.github.GitHubClient#noteRateLimit: (5ee9ab96) Missing or malformed X-RateLimit header: X-RateLimit-Limit: null
All of the errors from yesterday's comment were associated with the /repos/[org]/[name]
endpoint. It was inconsistently missing but happened quite frequent. In UTC time I see it occurring from Dec 13 23:33
through right now.
From Dec 13 23:33
UTC-0 through to now it malformed X-RateLimit
appears 7788 times. In the past 24 hours it appears 7123 times.
Source:
# using support plugin
cd /var/lib/logs/custom
find . -type f -mtime -1 -print0 | xargs -0 zgrep -F 'malformed X-RateLimit' | wc -l
7123
zgrep -F 'malformed X-RateLimit' * | wc -l
7788
It only occurs in org.kohsuke.github.GitHubClient
An update, our stability has significantly improved. There's still occasional flakiness with GitHub checks plugin not submitting its own commit status (e.g. still in progress after job finishes). But overall, developer interruption is resolved.
I’m going to call this resolved. The inability for GH checks to add status is rare and probably not worth chasing. Thanks for providing tunable options.
The bug is in the following lines of code
https://github.com/hub4j/github-api/blob/1cb9e66f7a762ad35f22a19ec854e2f8c4c6d45e/src/main/java/org/kohsuke/github/GitHubClient.java#L47-L52
It will only retry twice with 200ms in between retries.
Why this is a bug
GitHub branch source Jenkins plugin will drop builds occasionally from received webhooks. The GHBS plugin relies on GitHub plugin which relies on github-api plugin which provides this library as a client. Here's an exception from multibranch pipeline events.
How many retries should it be?
In practice, rolling out this patch from another plugin not using this library but does interact with GitHub I've seen GitHub API retried 28 times over the course of 1 minute. The retry limit for this was 30 and sleeping randomly between 1000 and 3000 ms. I've since increased the retry cap to 60 in my production system for this plugin.
https://github.com/jenkinsci/scm-filter-jervis-plugin/commit/c21d0c1d936d1ab71d93aba7684e00fa68d6e67e#diff-46da075f8e1e17ccf594a86bc96e8c8eaf295617b81d5ae5206634e37021bb49R119-R122
Ideal solution
You can keep the same retries but I would like this to be tunable by the user via system properties on the fly. That means do not use static properties except as a default value.
And for sleeping between retries I would like it to be random instead of a fixed value.
This should have a sane default but allow clients to tune them on the fly. Random delay between retries is a cloud best practice for interacting with distributed systems. See also https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/