Open arvindbr8 opened 1 month ago
@arvindbr8 looks like no report from 2 weeks. Should we downgrade to P2?
We should measure the flakiness w/10k runs and assign priority accordingly. Also consider new-ness -- if the test is new we should treat with much higher priority since we don't generally want to allow flakiness.
no failures in 10k forge runs
FAILED in 106 out of 100000 http://fusion2/49630097-747f-425d-b94e-3d6519c01e67
Taking a look now, but I see that this test has flaked in G/A twice over a year and a third of it being merged, so I don't think it's super urgent. The fact it only showed up this year Sept makes me think some bootstrap parsing code changed, as that's how it specifies to create a new server. Weird that that would induce flakiness though, because the waitgroup doesn't end until after that processes. Maybe there's a race between setting bootstrap and server actually starting.
100/100k = 1/1000 which is pretty low, but taking a look to see if we can find the root cause would be worth it.
Flaked 2x on my PR in a single snapshot:
https://github.com/grpc/grpc-go/actions/runs/11634120790/job/32400709681?pr=7798 https://github.com/grpc/grpc-go/actions/runs/11634120790/job/32400708629?pr=7798
Will take a look at this this week once I get some time looks like it's flaking a lot more now.
The fact the first occurence of this occurred sept of this year, over a year after it was merged makes me think it's racing with some of the xDS Client changes for fallback with respect to testing plumbing of xDS Bootstrap.
I'll take a look at this tomorrow.
Failure: https://github.com/grpc/grpc-go/actions/runs/11689787872/job/32553307641
Stack traces:
panic: test timed out after 7m0s
running tests:
Test (7m0s)
Test/ServeAndCloseDoNotRace (7m0s)
goroutine 369 [running]:
testing.(*M).startAlarm.func1()
/opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:2373 +0x265
created by time.goFunc
/opt/hostedtoolcache/go/1.23.2/x64/src/time/sleep.go:215 +0x45
goroutine 1 [chan receive, 7 minutes]:
testing.(*T).Run(0xc00015a4e0, {0x17b7e62, 0x4}, 0x181c398)
/opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:1751 +0x851
testing.runTests.func1(0xc00015a4e0)
/opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:[216](https://github.com/grpc/grpc-go/actions/runs/11689787872/job/32553307641#step:8:217)8 +0x86
testing.tRunner(0xc00015a4e0, 0xc00006bae0)
/opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:1690 +0x227
testing.runTests(0xc00012c2d0, {0x235b860, 0x2, 0x2}, {0x7fe8fec5c108?, 0x40?, 0x237b1c0?})
/opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:2166 +0x8bf
testing.(*M).Run(0xc000320000)
/opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:2034 +0xf18
main.main()
_testmain.go:49 +0x165
goroutine 4 [chan receive, 6 minutes]:
testing.(*T).Run(0xc00015a680, {0x14d6b6f, 0x16}, 0xc00019e440)
/opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:1751 +0x851
google.golang.org/grpc/internal/grpctest.RunSubTests(0xc00015a680, {0x195c4e0, 0x239c600})
/home/runner/work/grpc-go/grpc-go/internal/grpctest/grpctest.go:114 +0x352
google.golang.org/grpc/xds.Test(0xc00015a680)
/home/runner/work/grpc-go/grpc-go/xds/server_test.go:65 +0x35
testing.tRunner(0xc00015a680, 0x181c398)
/opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:1690 +0x[227](https://github.com/grpc/grpc-go/actions/runs/11689787872/job/32553307641#step:8:228)
created by testing.(*T).Run in goroutine 1
/opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:1743 +0x826
goroutine 119 [semacquire, 6 minutes]:
sync.runtime_Semacquire(0xc00034a2a8?)
/opt/hostedtoolcache/go/1.23.2/x64/src/runtime/sema.go:71 +0x25
sync.(*WaitGroup).Wait(0xc00034a2a0)
/opt/hostedtoolcache/go/1.23.2/x64/src/sync/waitgroup.go:118 +0xa5
google.golang.org/grpc/xds.s.TestServeAndCloseDoNotRace({{}}, 0xc00036a000)
/home/runner/work/grpc-go/grpc-go/xds/server_test.go:715 +0x54b
google.golang.org/grpc/internal/grpctest.RunSubTests.func1(0xc00036a000)
/home/runner/work/grpc-go/grpc-go/internal/grpctest/grpctest.go:122 +0x10e
goroutine 183 [chan receive, 6 minutes]:
google.golang.org/grpc/xds/internal/xdsclient.(*authority).watchResource(0xc0005d42a0, {0x19617a0, 0xc000827c50}, {0xc000059e80, 0x3a}, {0x195fa00, 0xc000862780})
/home/runner/work/grpc-go/grpc-go/xds/internal/xdsclient/authority.go:483 +0x2ab
google.golang.org/grpc/xds/internal/xdsclient.(*clientImpl).WatchResource(0xc000536680, {0x19617a0, 0xc000827c50}, {0xc000059e80, 0x3a}, {0x195fa00, 0xc000862780})
/home/runner/work/grpc-go/grpc-go/xds/internal/xdsclient/clientimpl_watchers.go:66 +0xae5
google.golang.org/grpc/xds/internal/xdsclient/xdsresource.WatchListener(...)
/home/runner/work/grpc-go/grpc-go/xds/internal/xdsclient/xdsresource/listener_resource_type.go:185
google.golang.org/grpc/xds/internal/server.NewListenerWrapper({{0x195f6b8, 0xc00036c100}, {0xc000059e80, 0x3a}, {0x7fe8b4dacfa0, 0xc00047dc70}, 0xc000862760})
/home/runner/work/grpc-go/grpc-go/xds/internal/server/listener_wrapper.go:102 +0xc3d
google.golang.org/grpc/xds.(*GRPCServer).Serve(0xc000802b00, {0x195f6b8, 0xc00036c100})
/home/runner/work/grpc-go/grpc-go/xds/server.go:201 +0x457
google.golang.org/grpc/xds.s.TestServeAndCloseDoNotRace.func1()
/home/runner/work/grpc-go/grpc-go/xds/server_test.go:707 +0x50
created by google.golang.org/grpc/xds.s.TestServeAndCloseDoNotRace in goroutine 119
/home/runner/work/grpc-go/grpc-go/xds/server_test.go:706 +0x47f
goroutine 181 [chan receive, 6 minutes]:
google.golang.org/grpc/xds/internal/xdsclient.(*authority).watchResource(0xc0005d42a0, {0x19617a0, 0xc000827b60}, {0xc000059e00, 0x3a}, {0x195fa00, 0xc0008626c0})
/home/runner/work/grpc-go/grpc-go/xds/internal/xdsclient/authority.go:483 +0x2ab
google.golang.org/grpc/xds/internal/xdsclient.(*clientImpl).WatchResource(0xc000536680, {0x19617a0, 0xc000827b60}, {0xc000059e00, 0x3a}, {0x195fa00, 0xc0008626c0})
/home/runner/work/grpc-go/grpc-go/xds/internal/xdsclient/clientimpl_watchers.go:66 +0xae5
google.golang.org/grpc/xds/internal/xdsclient/xdsresource.WatchListener(...)
/home/runner/work/grpc-go/grpc-go/xds/internal/xdsclient/xdsresource/listener_resource_type.go:185
google.golang.org/grpc/xds/internal/server.NewListenerWrapper({{0x195f6b8, 0xc00036c100}, {0xc000059e00, 0x3a}, {0x7fe8b4dacfa0, 0xc00047dc70}, 0xc0008626a0})
/home/runner/work/grpc-go/grpc-go/xds/internal/server/listener_wrapper.go:102 +0xc3d
google.golang.org/grpc/xds.(*GRPCServer).Serve(0xc000802780, {0x195f6b8, 0xc00036c100})
/home/runner/work/grpc-go/grpc-go/xds/server.go:201 +0x457
google.golang.org/grpc/xds.s.TestServeAndCloseDoNotRace.func1()
/home/runner/work/grpc-go/grpc-go/xds/server_test.go:707 +0x50
created by google.golang.org/grpc/xds.s.TestServeAndCloseDoNotRace in goroutine 119
/home/runner/work/grpc-go/grpc-go/xds/server_test.go:706 +0x47f
goroutine 169 [chan receive, 6 minutes]:
google.golang.org/grpc/xds/internal/xdsclient.(*authority).watchResource(0xc0005d42a0, {0x19617a0, 0xc000827740}, {0xc000059b80, 0x3a}, {0x195fa00, 0xc0008622a0})
/home/runner/work/grpc-go/grpc-go/xds/internal/xdsclient/authority.go:483 +0x2ab
google.golang.org/grpc/xds/internal/xdsclient.(*clientImpl).WatchResource(0xc000536680, {0x19617a0, 0xc000827740}, {0xc000059b80, 0x3a}, {0x195fa00, 0xc0008622a0})
/home/runner/work/grpc-go/grpc-go/xds/internal/xdsclient/clientimpl_watchers.go:66 +0xae5
google.golang.org/grpc/xds/internal/xdsclient/xdsresource.WatchListener(...)
/home/runner/work/grpc-go/grpc-go/xds/internal/xdsclient/xdsresource/listener_resource_type.go:185
google.golang.org/grpc/xds/internal/server.NewListenerWrapper({{0x195f6b8, 0xc00036c100}, {0xc000059b80, 0x3a}, {0x7fe8b4dacfa0, 0xc00047dc70}, 0xc00086[228](https://github.com/grpc/grpc-go/actions/runs/11689787872/job/32553307641#step:8:229)0})
/home/runner/work/grpc-go/grpc-go/xds/internal/server/listener_wrapper.go:102 +0xc3d
google.golang.org/grpc/xds.(*GRPCServer).Serve(0xc000595200, {0x195f6b8, 0xc00036c100})
/home/runner/work/grpc-go/grpc-go/xds/server.go:201 +0x457
google.golang.org/grpc/xds.s.TestServeAndCloseDoNotRace.func1()
/home/runner/work/grpc-go/grpc-go/xds/server_test.go:707 +0x50
created by google.golang.org/grpc/xds.s.TestServeAndCloseDoNotRace in goroutine 119
/home/runner/work/grpc-go/grpc-go/xds/server_test.go:706 +0x47f
A bunch of goroutines are stuck on authority.go:483 which reads from a channel: https://github.com/grpc/grpc-go/blob/0ec8fd84fdfb54f1b7f9c2d3d22aa20cd7a8cf09/xds/internal/xdsclient/authority.go#L483-L484
The change was introduced in https://github.com/grpc/grpc-go/pull/7773.
@easwars maybe we need to close the channel if the callback can't be scheduled because the serializer is closed? https://github.com/grpc/grpc-go/blob/0ec8fd84fdfb54f1b7f9c2d3d22aa20cd7a8cf09/xds/internal/xdsclient/authority.go#L430-L436
I think you are right. Let me send a PR. I also see another place which looks very similar to this.
Oh yeah makes sense good find Arjan.
Closing the done
channel when unable to schedule the serializer callback fixes the failures where it takes 7m
(the configured test timeout) to fail.
I'm now seeing another flake though:
grpc/xds/server_test.go:704: Failed to create an xDS enabled gRPC server: xDS client creation failed: xds: failed to get xDS bootstrap config: bootstrap environment variables ("GRPC_XDS_BOOTSTRAP" or "GRPC_XDS_BOOTSTRAP_CONFIG") not defined, and no fallback config set
This is weird and I'm trying to get to the bottom of this now.
Ok, I see what is going on:
server, err := NewGRPCServer(BootstrapContentsForTesting(...))
Serve
and Stop
When the server is created with a server option for bootstrap config, the server calls xdsclient.NewForTesting()
to create an xDS client instead of calling xds.New()
, which is what it would do in production code.
The xdsclient.NewForTesting()
, calls bootstrap.SetFallbackBootstrapConfig
to set the fallback bootstrap config to the one passed by the test. And it returns a cancel func which calls bootstrap.UnsetFallbackBootstrapConfigForTesting
.
The above two operations basically race against each other, and a server creation could see that the boostrap config is unset because an previous iteration erased it.
I'm still exploring options for the fix.
pkg
xds
link: https://github.com/grpc/grpc-go/actions/runs/10810537881/job/29988003566?pr=7619