scionproto / scion

SCION Internet Architecture
https://scion.org
Apache License 2.0
392 stars 160 forks source link

build: the end-2-end tests have become extremely flaky #4606

Closed jiceatscion closed 1 month ago

jiceatscion commented 2 months ago

They have less than 20% success rate.

oncilla commented 1 month ago

Papered over in #4605 we still need to investigate

jiceatscion commented 1 month ago

The patch helps only somewhat. I added a 10s delay after wait-connectivity. That's not even quite enough, so I added a triple retry on pings. That is enough for the test to succeed most of the time, but the retry only applies to the end2end_integration test. There's another test that keeps failing: scion_integration. That one doesn't have a retry option.

In the end we just need to figure out why it takes so long for path segments to become available.

jiceatscion commented 1 month ago

So, it appears that the segments are available after all (give-or-take a small fix in await-connectivity). What makes the tests fail is Deadline exceeded errors when trying to fetch the segments. Following the breadcrubs, I ended-up seeing a CS RPCing to another and both (if memory serves) of them disappearing for several seconds in the middle of processing the request. So, of course, the 10 s client timeout blows up eventually and so the whole chain of RPCs fails.

Increasing the timeout doesn't fix it, so it seems that the hangups can last indefinitely; until the timeout blows up.

oncilla commented 1 month ago

Reading the release notes carefully, this is the only thing that stands out:

https://tip.golang.org/doc/go1.23#timer-changes

And indeed, there is something in the Go issue tracker: https://github.com/golang/go/issues/69312 and the offending library https://github.com/quic-go/quic-go/pull/4659

In the meantime, we can downgrade, or use GODEBUG="asynctimerchan=1"

Downgrading Go version shows very high reliability: https://buildkite.com/scionproto/scion/builds/4751