ggilmore commented 8 months ago

QA Test Plan

Summary

PRs that implement this feature:

PRs that implement the Prometheus dashboards:

Test Plan

This feature touches every aspect of gRPC and has undergone extensive unit testing, and has been running on sourcegraph.com with no issues for over a month now.

This test plan will consist of:

1) A basic local test for one method 2) Examining the Prometheus retry dashboards for every service to see if there are any oddities

basic local test

I ran sourcegraph @ 5.3 with sg start --except gitserver-0 && sg start monitoring (notice that I have one of the gitserver instances disabled).

I then ran the following diff search with tracing enabled: https://sourcegraph.test:3443/search?q=context:global+test+type:diff&patternType=keyword&sm=0&trace=1

context:global test type:diff

That produced the following trace:

Screenshot 2024-02-05 at 16-29-57 Jaeger UI

example_trace.json

The Grafana dashboard for gitserver also shows the expected spike in retry count:

Running the same search without the disabled gitserver (sg start && sg start monitoring) shows the expected search results:

The trace also shows that the request wasn't retried:

Screenshot 2024-02-05 at 16-39-42 Jaeger UI

good_trace.json

sourcegraph.com Prometheus dashboards

frontend

screencapture-sourcegraph-debug-grafana-d-frontend-frontend-2024-02-05-16_51_49

We don't have an inordinate amount of retries.

The spike in retries corresponds to an Unavailable response from the server, so the feature seems to be working correctly:'

gitserver

https://sourcegraph.com/-/debug/grafana/d/gitserver/git-server?orgId=1&from=1706653431512&to=1707258231512&viewPanel=100802

screencapture-sourcegraph-debug-grafana-d-gitserver-git-server-2024-02-05-16_53_06

The spikes here also correspond to rollouts, so the feature is working correctly.

searcher

screencapture-sourcegraph-debug-grafana-d-searcher-searcher-2024-02-05-16_55_06

The spikes here also correspond to rollouts, so the feature is working correctly.

symbols

screencapture-sourcegraph-debug-grafana-d-symbols-symbols-2024-02-05-16_56_27

The spike here also corresponds to a rollouts, so the feature is working correctly.

repo-updater

screencapture-sourcegraph-debug-grafana-d-repo-updater-repo-updater-2024-02-05-16_53_54

There have been no retries attempted in the past 7 days.

QA Checklist

Have you made any infra related changes to environment variables, new services, or deployment methods that could affect customers?
- [ ] Yes - I've informed #team-cloud and #team-delivery of the changes.
- [ ] Yes - Changelog or documentation has been updated, this includes changes to defaults and site config flags.
- [x] No
If your change is non-trivial, please review the Cloud Launch process.

If you've made changes to documentation, please link them in the comments below.

Comments:
Which environments have the changes been tested on?
- [x] sourcegraph.com
Experimental features have been marked and behind a feature flag?
- [x] N/A
If no, please specify why: This feature isn't experimental. It has also been running for over a month on sourcegraph.com and S2.
Has telemetry been added as part of the product requirements?
- [x] N/A
Completed entry to release post.
- [x] N/A
Is a feature flagged in a way when turns the feature off, it behaves as-if the feature does not exist?
- [x] Yes

Yes, you can set SRC_GRPC_RETRY_MAX_ATTEMPTS=0 on every service: https://github.com/sourcegraph/sourcegraph/blob/afadc0ab3adabe5a1a734c3bd402e8764db89ad8/internal/grpc/defaults/retry.go#L19

A CHANGELOG entry has been added for the feature/change?
- [x] Yes: https://github.com/sourcegraph/sourcegraph/blob/8868eb5aa83f6f5e892bbe1d5f558959a4c60d64/CHANGELOG.md?plain=1#L91
Please provide any additional testing you've done that has not been covered above:

N/A

Tech Lead/DRI sign off: @kalanchan

eseliger commented 8 months ago

Great work, very clean journal!

eseliger commented 8 months ago

oops wrong button

eseliger commented 8 months ago

@kalanchan if this LGTM, can you please close the issue - thanks :)

eseliger commented 8 months ago

Release done, closing.

sourcegraph / sourcegraph-public-snapshot

gRPC Retry Test Plan #60191

QA Test Plan

Summary

Test Plan

basic local test

sourcegraph.com Prometheus dashboards

frontend

gitserver

searcher

symbols

repo-updater

QA Checklist