Closed bcmills closed 2 years ago
This looks related to #51323, and a problem that needs to be addressed in net/http (possibly via a change to golang.org/x/net/http2).
It's not viable for maintner to stop depending on net/http and Google front end servers, so I'm not sure what actions are available in this package.
Either way, if we suspect an error in an HTTP/2 implementation (either Go's or Google's), would it make sense for maintner
to work around it by retrying some bounded number of INTERNAL_ERROR
results during syncing?
(But this probably doesn't need to be a priority either way unless the failure rate increases. Mostly I've filed the issue in case it indicates some deeper systemic problem in conjunction with other maintner
and/or HTTP/2 issues.)
Maintner already does retries at a higher level. I don't think we should add code to it to detect and handle an internal error coming from the net/http
package, since that is out of scope. The real problem needs to be fixed in the HTTP layer, not in maintner which is one of many programs using it.
We can possibly add the retry to TestCorpusCheck as a test-only workaround sooner.
greplogs -l -e 'getting corpus: syncing segment \d+: stream error: stream ID \d+; INTERNAL_ERROR' --since=2022-03-30
2022-05-12T21:45:28-4aa4d2e-6365efb/linux-amd64-longtest
2022-04-12T22:05:26-2897e13-d85694a/windows-amd64-longtest
2022-03-31T14:57:35-27fe37a-378221b/linux-amd64-longtest
2022-03-31T14:51:17-27fe37a-2ea9376/linux-amd64-longtest
(But still nothing to be done here until we make progress on #51323, I think.)
Based on progress of investigation in #51323, the conclusion is that other than the error message not making it easy to see that the error is coming from a remote server for a possibly unavoidable reason (such as backend HTTP servers restarting, and graceful HTTP/2 stream shutdown being out of scope by intentional design), the net/http
package and the HTTP/2 protocol are working as intended (i.e., there isn't an internal error to be fixed elsewhere).
Since maintner already does retries, the problem is only in the test, and adding a retry to the test will be a complete fix for this issue of a flaky test (rather than a workaround as we previously thought).
Change https://go.dev/cl/414174 mentions this issue: maintner: retry network operations that may fail in getNewSegments
I initially wanted to update this issue with the latest status, and then saw that it'd be quick to send a trivial CL to close it. All that was needed was a trivial change to the following flaky-test-generating logic in the two copies of the getGoData
helper:
corpusCache, err = Get(context.Background())
if err != nil {
- tb.Fatalf("getting corpus: %v", err)
+ // Occasionally getting non-zero network errors
+ // while downloading 2+ GB of data from the internet
+ // is NOT unexpected!
+ //
+ // Doing t.Fatalf here means we're producing a non-actionable
+ // test failure, and while we haven't implemented go.dev/issue/19177 yet,
+ // any non-zero frequency false-positive flaky test failures
+ // may create additional manual triage work...
+ //
+ // So just handle the error from godata.Get by
+ // trying again or skipping the test—either way it
+ // would likely be better than t.Fatalf given the constraints above.
}
While doing that, I tried to confirm my claim that our maintner-using programs handle retries (which I knew to be true without looking earlier, since they haven't needed manual intervention to keep running in a very long time).
It turns out they all retry not just by reloading the corpus, but by fatally exiting the program and having k8s restart the pod. That is functional but feels wasteful (computationally). I didn't want to add retry loops to the many godata.Get
invocations in multiple x/build programs...
So that's the story of how I ended up with CL 414174. It felt easier to just detect possibly retryable network problems and retry them in netMutSource.getNewSegments
, allowing the least amount of partial work to be thrown away unnecessarily due to an occasional network error. (I tried to split it into two smaller changes, but it didn't work out well due to changes in nesting intersecting with both logical changes.)
greplogs --dashboard -md -l -e 'getting corpus: syncing segment \d+: stream error: stream ID \d+; INTERNAL_ERROR'
2022-03-29T15:43:06-e96d8cf-ae9ce82/windows-amd64-longtest 2021-04-08T21:58:35-83a8520-d67e739/windows-amd64-longtest 2021-04-08T19:58:50-83a8520-bb76193/windows-amd64-longtest 2020-10-19T18:36:14-2476803-06839e3/linux-amd64-longtest
(CC @golang/release)