Closed jiceatscion closed 1 month ago
Papered over in #4605 we still need to investigate
The patch helps only somewhat. I added a 10s delay after wait-connectivity. That's not even quite enough, so I added a triple retry on pings. That is enough for the test to succeed most of the time, but the retry only applies to the end2end_integration test. There's another test that keeps failing: scion_integration. That one doesn't have a retry option.
In the end we just need to figure out why it takes so long for path segments to become available.
So, it appears that the segments are available after all (give-or-take a small fix in await-connectivity). What makes the tests fail is Deadline exceeded errors when trying to fetch the segments. Following the breadcrubs, I ended-up seeing a CS RPCing to another and both (if memory serves) of them disappearing for several seconds in the middle of processing the request. So, of course, the 10 s client timeout blows up eventually and so the whole chain of RPCs fails.
Increasing the timeout doesn't fix it, so it seems that the hangups can last indefinitely; until the timeout blows up.
Reading the release notes carefully, this is the only thing that stands out:
https://tip.golang.org/doc/go1.23#timer-changes
And indeed, there is something in the Go issue tracker: https://github.com/golang/go/issues/69312 and the offending library https://github.com/quic-go/quic-go/pull/4659
In the meantime, we can downgrade, or use GODEBUG="asynctimerchan=1"
Downgrading Go version shows very high reliability: https://buildkite.com/scionproto/scion/builds/4751
They have less than 20% success rate.