bcmills commented 2 years ago

--- FAIL: TestCorpusCheck (100.40s)
    godata_test.go:52: getting corpus: syncing segment 130: stream error: stream ID 3; INTERNAL_ERROR; received from peer
2022/03/29 15:50:24 Loading data from log *maintner.netMutSource ...
2022/03/29 15:50:24 Downloading 2701 bytes of https://maintner.golang.org/logs/130 ...
2022/03/29 15:50:24 wrote C:\Users\gopher\.cache\golang-maintner\0130.growing.mutlog
2022/03/29 15:50:59 Reloaded data from log *maintner.netMutSource.
FAIL
FAIL    golang.org/x/build/maintner/godata  136.77

greplogs --dashboard -md -l -e 'getting corpus: syncing segment \d+: stream error: stream ID \d+; INTERNAL_ERROR'

2022-03-29T15:43:06-e96d8cf-ae9ce82/windows-amd64-longtest 2021-04-08T21:58:35-83a8520-d67e739/windows-amd64-longtest 2021-04-08T19:58:50-83a8520-bb76193/windows-amd64-longtest 2020-10-19T18:36:14-2476803-06839e3/linux-amd64-longtest

(CC @golang/release)

dmitshur commented 2 years ago

This looks related to #51323, and a problem that needs to be addressed in net/http (possibly via a change to golang.org/x/net/http2).

It's not viable for maintner to stop depending on net/http and Google front end servers, so I'm not sure what actions are available in this package.

Disable HTTP/2? That's not great given as I understand this problem happens quite rarely and probably not worth losing benefits of HTTP/2 over.
Skipping the test? Similarly we'd lose test coverage that we get each time this error doesn't happen.

bcmills commented 2 years ago

51323 looks related, although the high rate of errors there appears to have started in Oct. 2021, whereas the first of these failures was observed Oct. 2020 (a year earlier). 🤔

Either way, if we suspect an error in an HTTP/2 implementation (either Go's or Google's), would it make sense for maintner to work around it by retrying some bounded number of INTERNAL_ERROR results during syncing?

(But this probably doesn't need to be a priority either way unless the failure rate increases. Mostly I've filed the issue in case it indicates some deeper systemic problem in conjunction with other maintner and/or HTTP/2 issues.)

dmitshur commented 2 years ago

Maintner already does retries at a higher level. I don't think we should add code to it to detect and handle an internal error coming from the net/http package, since that is out of scope. The real problem needs to be fixed in the HTTP layer, not in maintner which is one of many programs using it.

We can possibly add the retry to TestCorpusCheck as a test-only workaround sooner.

bcmills commented 2 years ago

greplogs -l -e 'getting corpus: syncing segment \d+: stream error: stream ID \d+; INTERNAL_ERROR' --since=2022-03-30 2022-05-12T21:45:28-4aa4d2e-6365efb/linux-amd64-longtest 2022-04-12T22:05:26-2897e13-d85694a/windows-amd64-longtest 2022-03-31T14:57:35-27fe37a-378221b/linux-amd64-longtest 2022-03-31T14:51:17-27fe37a-2ea9376/linux-amd64-longtest

(But still nothing to be done here until we make progress on #51323, I think.)

dmitshur commented 2 years ago

Based on progress of investigation in #51323, the conclusion is that other than the error message not making it easy to see that the error is coming from a remote server for a possibly unavoidable reason (such as backend HTTP servers restarting, and graceful HTTP/2 stream shutdown being out of scope by intentional design), the net/http package and the HTTP/2 protocol are working as intended (i.e., there isn't an internal error to be fixed elsewhere).

Since maintner already does retries, the problem is only in the test, and adding a retry to the test will be a complete fix for this issue of a flaky test (rather than a workaround as we previously thought).

gopherbot commented 2 years ago

Change https://go.dev/cl/414174 mentions this issue: maintner: retry network operations that may fail in getNewSegments

dmitshur commented 2 years ago

I initially wanted to update this issue with the latest status, and then saw that it'd be quick to send a trivial CL to close it. All that was needed was a trivial change to the following flaky-test-generating logic in the two copies of the getGoData helper:

 corpusCache, err = Get(context.Background())
 if err != nil {
-   tb.Fatalf("getting corpus: %v", err)
+   // Occasionally getting non-zero network errors
+   // while downloading 2+ GB of data from the internet
+   // is NOT unexpected!
+   //
+   // Doing t.Fatalf here means we're producing a non-actionable
+   // test failure, and while we haven't implemented go.dev/issue/19177 yet,
+   // any non-zero frequency false-positive flaky test failures
+   // may create additional manual triage work...
+   //
+   // So just handle the error from godata.Get by
+   // trying again or skipping the test—either way it
+   // would likely be better than t.Fatalf given the constraints above.
 }

While doing that, I tried to confirm my claim that our maintner-using programs handle retries (which I knew to be true without looking earlier, since they haven't needed manual intervention to keep running in a very long time).

It turns out they all retry not just by reloading the corpus, but by fatally exiting the program and having k8s restart the pod. That is functional but feels wasteful (computationally). I didn't want to add retry loops to the many godata.Get invocations in multiple x/build programs...

So that's the story of how I ended up with CL 414174. It felt easier to just detect possibly retryable network problems and retry them in netMutSource.getNewSegments, allowing the least amount of partial work to be thrown away unnecessarily due to an occasional network error. (I tried to split it into two smaller changes, but it didn't work out well due to changes in nesting intersecting with both logical changes.)

dmitshur commented 2 years ago

I believe CL 414174 is a complete fix and there's nothing more to do here, so tentatively closing but please reopen if there are further occurrences.

golang / go

x/build/maintner/godata, x/build/maintner/maintnerd/maintapi: getGoData helper should handle failure without failing test, otherwise that causes occasional false positive test failures due to HTTP/2 internal errors while syncing segments #52048

51323 looks related, although the high rate of errors there appears to have started in Oct. 2021, whereas the first of these failures was observed Oct. 2020 (a year earlier). 🤔