grpc / grpc-go

The Go language implementation of gRPC. HTTP/2 based RPC
https://grpc.io
Apache License 2.0
20.87k stars 4.33k forks source link

Flaky test package: test/xds #6914

Open zasweq opened 8 months ago

zasweq commented 8 months ago

Alongside #6913 and #6912, I have ran the test/xds suite on master since I added tests to it for my xDS Server fix #6889. I have encountered numerous flakes on g3, particularly those outlined in custom lb tests for distribution #6601. However, I have encountered almost every client and server side xDS test flake with a context timeout for a RPC expected to proceed. Each has different logs/events preceeding it's timeout, but every test seems susceptible to timeout. The flakes are generally rare, but due to the number of tests in the test suite you can successfully trigger by running the full test suite enough times. My initial inkling tells me there's some synchronization needed or something gets stuck in the management server/testing xDS Client flow. This also manifests in rare flakes for my xDS Server fix, where I expect something like an err that represents Accept and Close, and I get a context timeout instead.

arvindbr8 commented 8 months ago

another one for TestServerSideXDS_WithValidAndInvalidSecurityConfiguration: https://github.com/grpc/grpc-go/actions/runs/7480796959/job/20361025267?pr=6916

zasweq commented 7 months ago

https://github.com/grpc/grpc-go/actions/runs/7618526481/job/20749870802?pr=6933

arvindbr8 commented 7 months ago

https://github.com/grpc/grpc-go/actions/runs/7716359885/job/21032969470?pr=6949

arvindbr8 commented 7 months ago

https://github.com/grpc/grpc-go/actions/runs/7791646589/job/21248143511?pr=6965

zasweq commented 5 months ago

https://github.com/grpc/grpc-go/actions/runs/8546803299/job/23417838293?pr=7085

arjan-bal commented 2 months ago

https://github.com/grpc/grpc-go/actions/runs/9790522733/job/27032314481?pr=7390

arjan-bal commented 2 months ago

@zasweq I investigated this and the problem seems to be due to the xDS management server getting stuck while writing to this buffered channel https://github.com/grpc/grpc-go/blob/d27ddb5eb5940c949f88bc2cb21eed9254f8be75/test/xds/xds_server_certificate_providers_test.go#L249

In the logs of failing runs for TestServerSideXDS_WithValidAndInvalidSecurityConfiguration, I noticed that the resource snapshot update request is sent to the xds management server before the xds client is able to connect to the xds server. This somehow results in more than 1 Listener requests being sent to the xds server which get stuck waiting to write to the buffered channel.

This seems to be a problem with the test and not the implementation. Adding a 50 millis sleep after starting both the servers did get rid of the flakiness in TestServerSideXDS_WithValidAndInvalidSecurityConfiguration.

zasweq commented 2 months ago

Ah nice thank you for figuring this out!

arvindbr8 commented 1 month ago

https://github.com/grpc/grpc-go/actions/runs/9996172593/job/27629932247?pr=7397

zasweq commented 1 month ago

You mentioned this solved the test, but not the flakes in the full package. This was my flaky test in this PR so thanks for fixing this: https://github.com/grpc/grpc-go/actions/runs/10050840269/job/27779434995?pr=7434 :).