sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

gRPC Retry Test Plan #60191

Closed ggilmore closed 8 months ago

ggilmore commented 8 months ago

QA Test Plan

Summary

PRs that implement this feature:

PRs that implement the Prometheus dashboards:

Test Plan

This feature touches every aspect of gRPC and has undergone extensive unit testing, and has been running on sourcegraph.com with no issues for over a month now.

This test plan will consist of:

1) A basic local test for one method 2) Examining the Prometheus retry dashboards for every service to see if there are any oddities

basic local test

I ran sourcegraph @ 5.3 with sg start --except gitserver-0 && sg start monitoring (notice that I have one of the gitserver instances disabled).

I then ran the following diff search with tracing enabled: https://sourcegraph.test:3443/search?q=context:global+test+type:diff&patternType=keyword&sm=0&trace=1

context:global test type:diff

That produced the following trace:

Screenshot 2024-02-05 at 16-29-57 Jaeger UI

example_trace.json

The Grafana dashboard for gitserver also shows the expected spike in retry count:

Screenshot 2024-02-05 at 4 36 12 PM

Running the same search without the disabled gitserver (sg start && sg start monitoring) shows the expected search results:

Screenshot 2024-02-05 at 4 42 14 PM

The trace also shows that the request wasn't retried:

Screenshot 2024-02-05 at 16-39-42 Jaeger UI

good_trace.json

sourcegraph.com Prometheus dashboards

frontend

screencapture-sourcegraph-debug-grafana-d-frontend-frontend-2024-02-05-16_51_49

We don't have an inordinate amount of retries.

The spike in retries corresponds to an Unavailable response from the server, so the feature seems to be working correctly:'

Screenshot 2024-02-05 at 9 38 42 PM

gitserver

https://sourcegraph.com/-/debug/grafana/d/gitserver/git-server?orgId=1&from=1706653431512&to=1707258231512&viewPanel=100802

screencapture-sourcegraph-debug-grafana-d-gitserver-git-server-2024-02-05-16_53_06

The spikes here also correspond to rollouts, so the feature is working correctly.

searcher

screencapture-sourcegraph-debug-grafana-d-searcher-searcher-2024-02-05-16_55_06

The spikes here also correspond to rollouts, so the feature is working correctly.

symbols

screencapture-sourcegraph-debug-grafana-d-symbols-symbols-2024-02-05-16_56_27

The spike here also corresponds to a rollouts, so the feature is working correctly.

repo-updater

screencapture-sourcegraph-debug-grafana-d-repo-updater-repo-updater-2024-02-05-16_53_54

There have been no retries attempted in the past 7 days.

QA Checklist

  1. Have you made any infra related changes to environment variables, new services, or deployment methods that could affect customers?

    • [ ] Yes - I've informed #team-cloud and #team-delivery of the changes.
    • [ ] Yes - Changelog or documentation has been updated, this includes changes to defaults and site config flags.
    • [x] No

    If your change is non-trivial, please review the Cloud Launch process.

    If you've made changes to documentation, please link them in the comments below.

    Comments:

  2. Which environments have the changes been tested on?

    • [x] sourcegraph.com
  3. Experimental features have been marked and behind a feature flag?

    • [x] N/A

    If no, please specify why: This feature isn't experimental. It has also been running for over a month on sourcegraph.com and S2.

  4. Has telemetry been added as part of the product requirements?

    • [x] N/A
  5. Completed entry to release post.

    • [x] N/A
  6. Is a feature flagged in a way when turns the feature off, it behaves as-if the feature does not exist?

    • [x] Yes

Yes, you can set SRC_GRPC_RETRY_MAX_ATTEMPTS=0 on every service: https://github.com/sourcegraph/sourcegraph/blob/afadc0ab3adabe5a1a734c3bd402e8764db89ad8/internal/grpc/defaults/retry.go#L19

  1. A CHANGELOG entry has been added for the feature/change?

  2. Please provide any additional testing you've done that has not been covered above:

N/A


Tech Lead/DRI sign off: @kalanchan

eseliger commented 8 months ago

Great work, very clean journal!

eseliger commented 8 months ago

oops wrong button

eseliger commented 8 months ago

@kalanchan if this LGTM, can you please close the issue - thanks :)

eseliger commented 8 months ago

Release done, closing.