This feature touches every aspect of gRPC and has undergone extensive unit testing, and has been running on sourcegraph.com with no issues for over a month now.
This test plan will consist of:
1) A basic local test for one method
2) Examining the Prometheus retry dashboards for every service to see if there are any oddities
basic local test
I ran sourcegraph @ 5.3 with sg start --except gitserver-0 && sg start monitoring (notice that I have one of the gitserver instances disabled).
QA Test Plan
Summary
PRs that implement this feature:
PRs that implement the Prometheus dashboards:
Test Plan
This feature touches every aspect of gRPC and has undergone extensive unit testing, and has been running on sourcegraph.com with no issues for over a month now.
This test plan will consist of:
1) A basic local test for one method 2) Examining the Prometheus retry dashboards for every service to see if there are any oddities
basic local test
I ran sourcegraph @ 5.3 with
sg start --except gitserver-0 && sg start monitoring
(notice that I have one of the gitserver instances disabled).I then ran the following diff search with tracing enabled: https://sourcegraph.test:3443/search?q=context:global+test+type:diff&patternType=keyword&sm=0&trace=1
That produced the following trace:
example_trace.json
The Grafana dashboard for gitserver also shows the expected spike in retry count:
Running the same search without the disabled gitserver (
sg start && sg start monitoring
) shows the expected search results:The trace also shows that the request wasn't retried:
good_trace.json
sourcegraph.com Prometheus dashboards
frontend
We don't have an inordinate amount of retries.
The spike in retries corresponds to an
Unavailable
response from the server, so the feature seems to be working correctly:'gitserver
https://sourcegraph.com/-/debug/grafana/d/gitserver/git-server?orgId=1&from=1706653431512&to=1707258231512&viewPanel=100802
The spikes here also correspond to rollouts, so the feature is working correctly.
searcher
The spikes here also correspond to rollouts, so the feature is working correctly.
symbols
The spike here also corresponds to a rollouts, so the feature is working correctly.
repo-updater
There have been no retries attempted in the past 7 days.
QA Checklist
Have you made any infra related changes to environment variables, new services, or deployment methods that could affect customers?
If your change is non-trivial, please review the Cloud Launch process.
If you've made changes to documentation, please link them in the comments below.
Comments:
Which environments have the changes been tested on?
Experimental features have been marked and behind a feature flag?
If no, please specify why: This feature isn't experimental. It has also been running for over a month on sourcegraph.com and S2.
Has telemetry been added as part of the product requirements?
Completed entry to release post.
Is a feature flagged in a way when turns the feature off, it behaves as-if the feature does not exist?
Yes, you can set
SRC_GRPC_RETRY_MAX_ATTEMPTS=0
on every service: https://github.com/sourcegraph/sourcegraph/blob/afadc0ab3adabe5a1a734c3bd402e8764db89ad8/internal/grpc/defaults/retry.go#L19A CHANGELOG entry has been added for the feature/change?
Please provide any additional testing you've done that has not been covered above:
N/A
Tech Lead/DRI sign off: @kalanchan