sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

Rethink monitoring of remaining code host rate limits #16455

Closed tsenart closed 3 years ago

tsenart commented 3 years ago

With the introduction of user added external services, we now have a potentially huge number of tokens to monitor. We can't add each token to the src_github_rate_limit_remaining metric, because that's too high cardinality.

As a result of the current state of things, the alert [Cloud] [CRITICAL] github-proxy: less than 500 remaining calls to GitHub before hitting the rate limit for 5m0s is incorrect, because the metric flips up and down, through multiple tokens instead of one.

We could just capture a percentage of external services whose tokens quotas are below a certain threshold. Or monitor only site level external services.

github-actions[bot] commented 3 years ago

Heads up @tsenart - the "team/cloud" label was applied to this issue.

ryanslade commented 3 years ago

I think we should only monitor site level external service and specifically on .com we should only monitor our "global" GitHub connection.

The easiest way to do this is going to be to add a db connection to github-proxy as we recently did with gitserver, https://github.com/sourcegraph/sourcegraph/pull/16121

I'm not super happy with this, but I don't see a better way of doing this right now without having decent service discovery and promoting on of our other services to be the "source of truth" for this kind of thing.

We can have it query for the token(s) periodically and cache the result locally and then grab the token out of the requests for comparison.

@tsenart How does this sound?

tsenart commented 3 years ago

The easiest way to do this is going to be to add a db connection to github-proxy as we recently did with gitserver, #16121

I think it'd be much easier to expose this metric from repo-updater instead. We have the data there, since we talk to the code hosts through sources. And we can filter there by external service type.

ryanslade commented 3 years ago

I started looking into this today so leaving some notes here as I'm off until next week.

We now have global rate limit monitors keyed by code host and token, but they don't differentiate between rest, graphql and search limits correctly so we'll need to fix that first. Note that they all use the same token / apiurl which means they'll share the same monitor:

https://github.com/sourcegraph/sourcegraph/blob/main/cmd/repo-updater/repos/github.go#L126-L128

Then we need to update NewGitHubSource so that it determines whether we are using a "site level" external service and enable metric collection only if we are.

We can collect our rate limit remaining with the following labels:

Given that these are only site level external services the cardinality should stay small.