Closed frouioui closed 4 days ago
Hello reviewers! :wave: Please follow this checklist when reviewing this Pull Request.
release notes (needs details)
label if users need to know about this change.-
), and have a clear help text.Jobs
should be named in order to mark it as required
.required
, the maintainer team must be notified._vt
tables and RPCs need to be backward compatible.vtctl
command output order should be stable and awk
-able.Attention: Patch coverage is 41.72662%
with 81 lines
in your changes missing coverage. Please review.
Project coverage is 67.38%. Comparing base (
78f7db2
) to head (8797ec7
). Report is 8 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Adding Do Not Merge
as I want to re-run the workflows several times before merging this.
It does seem as though the last commit has started breaking one of the zk2 topo tests.
@mattlord and I have seen this test flakes in the past. I will merge this PR as the failure observed on the ZK2 test does not seem to be caused by this PR.
Description
This PR was motivated by what first looked like a flaky test,
TestRunFailsToStartTabletManager
was failing very often due to either a segfault or a race condition, both on thetopo.Server
. Turns out that this test is actually exposing a race condition that can only happen when the server is being closed and another goroutine is accessing thetopo.Server
's connections.Below are the two traces that were observed when running
TestRunFailsToStartTabletManager
:Segmentation fault
``` panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x11ca08a] goroutine 58 [running]: vitess.io/vitess/go/vt/topo.(*Lock).lock(0xc00098c420, {0x2dc5d08, 0xc0006b2030}, 0xc0005332c0, {0x2dbdbe0, 0xc0004a6030}, {0x0?, 0x10?, 0xc000086d40?}) /home/runner/work/vitess/vitess/go/vt/topo/locks.go:187 +0x4ca vitess.io/vitess/go/vt/topo.(*Server).internalLock(0xc0005332c0, {0x2dc5d40, 0xc00069c4b0}, {0x2dbdbe0, 0xc0004a6030}, {0x26f3105, 0xf}, {0x0, 0x0, 0x0}) /home/runner/work/vitess/vitess/go/vt/topo/locks.go:233 +0x26f vitess.io/vitess/go/vt/topo.(*Server).LockKeyspace(0xc0005332c0, {0x2dc5d40, 0xc00069c4b0}, {0x26db470, 0x2}, {0x26f3105, 0xf}, {0x0, 0x0, 0x0}) /home/runner/work/vitess/vitess/go/vt/topo/keyspace_lock.go:47 +0xd6 vitess.io/vitess/go/vt/topotools.RebuildKeyspace({0x2dc5d40, 0xc00069c4b0}, {0x2dd75c0, 0xc000749f00}, 0xc0005332c0, {0x26db470, 0x2}, {0xc000749f70, 0x1, 0x1}, ...) /home/runner/work/vitess/vitess/go/vt/topotools/rebuild_keyspace.go:34 +0xc5 vitess.io/vitess/go/vt/vttablet/tabletmanager.(*TabletManager).rebuildKeyspace(0xc0003fe1e0, {0x2dc5d40, 0xc00069c4b0}, 0xc000197eb0?, {0x26db470, 0x2}, 0x3b9aca00) /home/runner/work/vitess/vitess/go/vt/vttablet/tabletmanager/tm_init.go:650 +0x22e created by vitess.io/vitess/go/vt/vttablet/tabletmanager.(*TabletManager).createKeyspaceShard in goroutine 39 /home/runner/work/vitess/vitess/go/vt/vttablet/tabletmanager/tm_init.go:581 +0x590 ```Race condition
``` 2024-11-03T22:10:32.0591213Z ================== 2024-11-03T22:10:32.0591453Z WARNING: DATA RACE 2024-11-03T22:10:32.0591728Z Read at 0x00c000898000 by goroutine 62: 2024-11-03T22:10:32.0592161Z vitess.io/vitess/go/vt/topo.(*Server).GetCellInfo() 2024-11-03T22:10:32.0592797Z /home/runner/work/vitess/vitess/go/vt/topo/cell_info.go:62 +0x65 2024-11-03T22:10:32.0593348Z vitess.io/vitess/go/vt/topo.(*Server).ConnForCell() 2024-11-03T22:10:32.0594006Z /home/runner/work/vitess/vitess/go/vt/topo/server.go:275 +0x228 2024-11-03T22:10:32.0594586Z vitess.io/vitess/go/vt/topo.(*Server).UpdateSrvKeyspace() 2024-11-03T22:10:32.0595255Z /home/runner/work/vitess/vitess/go/vt/topo/srv_keyspace.go:630 +0x93 2024-11-03T22:10:32.0595900Z vitess.io/vitess/go/vt/topotools.RebuildKeyspaceLocked.func1() 2024-11-03T22:10:32.0596679Z /home/runner/work/vitess/vitess/go/vt/topotools/rebuild_keyspace.go:151 +0x132 2024-11-03T22:10:32.0597386Z vitess.io/vitess/go/vt/topotools.RebuildKeyspaceLocked.gowrap1() 2024-11-03T22:10:32.0598175Z /home/runner/work/vitess/vitess/go/vt/topotools/rebuild_keyspace.go:154 +0x61 2024-11-03T22:10:32.0598611Z 2024-11-03T22:10:32.0598772Z Previous write at 0x00c000898000 by goroutine 40: 2024-11-03T22:10:32.0599211Z vitess.io/vitess/go/vt/topo.(*Server).Close() 2024-11-03T22:10:32.0599995Z /home/runner/work/vitess/vitess/go/vt/topo/server.go:352 +0x137 2024-11-03T22:10:32.0600567Z vitess.io/vitess/go/cmd/vttablet/cli.run.deferwrap1() 2024-11-03T22:10:32.0601228Z /home/runner/work/vitess/vitess/go/cmd/vttablet/cli/cli.go:118 +0x33 2024-11-03T22:10:32.0601725Z runtime.deferreturn() 2024-11-03T22:10:32.0602228Z /opt/hostedtoolcache/go/1.23.2/x64/src/runtime/panic.go:605 +0x5d 2024-11-03T22:10:32.0602734Z github.com/spf13/cobra.(*Command).execute() 2024-11-03T22:10:32.0603378Z /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:985 +0x10f3 2024-11-03T22:10:32.0603940Z github.com/spf13/cobra.(*Command).ExecuteC() 2024-11-03T22:10:32.0604586Z /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117 +0x655 2024-11-03T22:10:32.0605134Z github.com/spf13/cobra.(*Command).Execute() 2024-11-03T22:10:32.0605778Z /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041 +0x5a9 2024-11-03T22:10:32.0606373Z github.com/spf13/cobra.(*Command).ExecuteContext() 2024-11-03T22:10:32.0607041Z /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1034 +0x5a4 2024-11-03T22:10:32.0607913Z vitess.io/vitess/go/cmd/vttablet/cli.TestRunFailsToStartTabletManager() 2024-11-03T22:10:32.0608706Z /home/runner/work/vitess/vitess/go/cmd/vttablet/cli/cli_test.go:56 +0x52d 2024-11-03T22:10:32.0609214Z testing.tRunner() 2024-11-03T22:10:32.0609716Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:1690 +0x226 2024-11-03T22:10:32.0610206Z testing.(*T).Run.gowrap1() 2024-11-03T22:10:32.0610726Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:1743 +0x44 2024-11-03T22:10:32.0611088Z 2024-11-03T22:10:32.0611209Z Goroutine 62 (running) created at: 2024-11-03T22:10:32.0611657Z vitess.io/vitess/go/vt/topotools.RebuildKeyspaceLocked() 2024-11-03T22:10:32.0612398Z /home/runner/work/vitess/vitess/go/vt/topotools/rebuild_keyspace.go:149 +0x884 2024-11-03T22:10:32.0613036Z vitess.io/vitess/go/vt/topotools.RebuildKeyspace() 2024-11-03T22:10:32.0613740Z /home/runner/work/vitess/vitess/go/vt/topotools/rebuild_keyspace.go:40 +0x286 2024-11-03T22:10:32.0614521Z vitess.io/vitess/go/vt/vttablet/tabletmanager.(*TabletManager).rebuildKeyspace() 2024-11-03T22:10:32.0615406Z /home/runner/work/vitess/vitess/go/vt/vttablet/tabletmanager/tm_init.go:650 +0x3c4 2024-11-03T22:10:32.0616280Z vitess.io/vitess/go/vt/vttablet/tabletmanager.(*TabletManager).createKeyspaceShard.gowrap2() 2024-11-03T22:10:32.0617211Z /home/runner/work/vitess/vitess/go/vt/vttablet/tabletmanager/tm_init.go:581 +0x8f 2024-11-03T22:10:32.0617662Z 2024-11-03T22:10:32.0617786Z Goroutine 40 (running) created at: 2024-11-03T22:10:32.0618107Z testing.(*T).Run() 2024-11-03T22:10:32.0618611Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:1743 +0x825 2024-11-03T22:10:32.0619096Z testing.runTests.func1() 2024-11-03T22:10:32.0619620Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:2168 +0x85 2024-11-03T22:10:32.0620079Z testing.tRunner() 2024-11-03T22:10:32.0620573Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:1690 +0x226 2024-11-03T22:10:32.0621043Z testing.runTests() 2024-11-03T22:10:32.0621534Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:2166 +0x8be 2024-11-03T22:10:32.0622169Z testing.(*M).Run() 2024-11-03T22:10:32.0622659Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:2034 +0xf17 2024-11-03T22:10:32.0623102Z main.main() 2024-11-03T22:10:32.0623394Z _testmain.go:45 +0x164 2024-11-03T22:10:32.0623663Z ================== ```I have been through the code and find at least 3 other code paths that can expose a similar race condition too:
Proposed Solutions
I have explored several options to fix this issue:
topo.Server
, locking it whenever we need to Close it or use the connections. Dropped off this idea as it would impact performance negatively, specially for our gRPC servers that use topo.Server a lot and need fast concurrent operations. gRPC is not even affected by such panic/segfault, as when closing the server we first wait for all gRPC connections to close.topo.Server
we also close the topo connection and set it to nil, but in our implementations of the topo connections we don't check if the topo connection is nil or not before using it. For this reason and for a potential performance impact I gave up this idea.topo.Server
fields tonil
. This was given up for the same reason (on the topo connection implementation) explained in idea number 2.true
or increase a counter, theClose
method would only proceed when the boolean is set tofalse
or the counter is equal to 0, this would imply we add some sort of timeout to theClose
method to "force-close" after a certain time even if the not all usages of the connection are complete. This method would be a bit more invasive and require a decent amount of changes. Given that the impact of this issue is very small and only happens if the binary was going to stop anyway, I opted for option 5 instead.Close
method on thetopo.Server
, checking the context before accessing the topo connections should avoid a lot of our issues. I think this is a good first approach, if we observe more panics/segfault in the future we could implement option 4.This PR implements option 5, making sure we are checking the context every time we need to access the topo connection. It is not a 100% safe solution as some race condition may still happen, but: 1. they're already pretty unlikely, 2. this will safeguard a lot of potential race condition already.