vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.67k stars 2.1k forks source link

Limit rare race condition in `topo.Server` #17165

Closed frouioui closed 4 days ago

frouioui commented 1 week ago

Description

This PR was motivated by what first looked like a flaky test, TestRunFailsToStartTabletManager was failing very often due to either a segfault or a race condition, both on the topo.Server. Turns out that this test is actually exposing a race condition that can only happen when the server is being closed and another goroutine is accessing the topo.Server's connections.

Below are the two traces that were observed when running TestRunFailsToStartTabletManager:

Segmentation fault ``` panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x11ca08a] goroutine 58 [running]: vitess.io/vitess/go/vt/topo.(*Lock).lock(0xc00098c420, {0x2dc5d08, 0xc0006b2030}, 0xc0005332c0, {0x2dbdbe0, 0xc0004a6030}, {0x0?, 0x10?, 0xc000086d40?}) /home/runner/work/vitess/vitess/go/vt/topo/locks.go:187 +0x4ca vitess.io/vitess/go/vt/topo.(*Server).internalLock(0xc0005332c0, {0x2dc5d40, 0xc00069c4b0}, {0x2dbdbe0, 0xc0004a6030}, {0x26f3105, 0xf}, {0x0, 0x0, 0x0}) /home/runner/work/vitess/vitess/go/vt/topo/locks.go:233 +0x26f vitess.io/vitess/go/vt/topo.(*Server).LockKeyspace(0xc0005332c0, {0x2dc5d40, 0xc00069c4b0}, {0x26db470, 0x2}, {0x26f3105, 0xf}, {0x0, 0x0, 0x0}) /home/runner/work/vitess/vitess/go/vt/topo/keyspace_lock.go:47 +0xd6 vitess.io/vitess/go/vt/topotools.RebuildKeyspace({0x2dc5d40, 0xc00069c4b0}, {0x2dd75c0, 0xc000749f00}, 0xc0005332c0, {0x26db470, 0x2}, {0xc000749f70, 0x1, 0x1}, ...) /home/runner/work/vitess/vitess/go/vt/topotools/rebuild_keyspace.go:34 +0xc5 vitess.io/vitess/go/vt/vttablet/tabletmanager.(*TabletManager).rebuildKeyspace(0xc0003fe1e0, {0x2dc5d40, 0xc00069c4b0}, 0xc000197eb0?, {0x26db470, 0x2}, 0x3b9aca00) /home/runner/work/vitess/vitess/go/vt/vttablet/tabletmanager/tm_init.go:650 +0x22e created by vitess.io/vitess/go/vt/vttablet/tabletmanager.(*TabletManager).createKeyspaceShard in goroutine 39 /home/runner/work/vitess/vitess/go/vt/vttablet/tabletmanager/tm_init.go:581 +0x590 ```
Race condition ``` 2024-11-03T22:10:32.0591213Z ================== 2024-11-03T22:10:32.0591453Z WARNING: DATA RACE 2024-11-03T22:10:32.0591728Z Read at 0x00c000898000 by goroutine 62: 2024-11-03T22:10:32.0592161Z vitess.io/vitess/go/vt/topo.(*Server).GetCellInfo() 2024-11-03T22:10:32.0592797Z /home/runner/work/vitess/vitess/go/vt/topo/cell_info.go:62 +0x65 2024-11-03T22:10:32.0593348Z vitess.io/vitess/go/vt/topo.(*Server).ConnForCell() 2024-11-03T22:10:32.0594006Z /home/runner/work/vitess/vitess/go/vt/topo/server.go:275 +0x228 2024-11-03T22:10:32.0594586Z vitess.io/vitess/go/vt/topo.(*Server).UpdateSrvKeyspace() 2024-11-03T22:10:32.0595255Z /home/runner/work/vitess/vitess/go/vt/topo/srv_keyspace.go:630 +0x93 2024-11-03T22:10:32.0595900Z vitess.io/vitess/go/vt/topotools.RebuildKeyspaceLocked.func1() 2024-11-03T22:10:32.0596679Z /home/runner/work/vitess/vitess/go/vt/topotools/rebuild_keyspace.go:151 +0x132 2024-11-03T22:10:32.0597386Z vitess.io/vitess/go/vt/topotools.RebuildKeyspaceLocked.gowrap1() 2024-11-03T22:10:32.0598175Z /home/runner/work/vitess/vitess/go/vt/topotools/rebuild_keyspace.go:154 +0x61 2024-11-03T22:10:32.0598611Z 2024-11-03T22:10:32.0598772Z Previous write at 0x00c000898000 by goroutine 40: 2024-11-03T22:10:32.0599211Z vitess.io/vitess/go/vt/topo.(*Server).Close() 2024-11-03T22:10:32.0599995Z /home/runner/work/vitess/vitess/go/vt/topo/server.go:352 +0x137 2024-11-03T22:10:32.0600567Z vitess.io/vitess/go/cmd/vttablet/cli.run.deferwrap1() 2024-11-03T22:10:32.0601228Z /home/runner/work/vitess/vitess/go/cmd/vttablet/cli/cli.go:118 +0x33 2024-11-03T22:10:32.0601725Z runtime.deferreturn() 2024-11-03T22:10:32.0602228Z /opt/hostedtoolcache/go/1.23.2/x64/src/runtime/panic.go:605 +0x5d 2024-11-03T22:10:32.0602734Z github.com/spf13/cobra.(*Command).execute() 2024-11-03T22:10:32.0603378Z /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:985 +0x10f3 2024-11-03T22:10:32.0603940Z github.com/spf13/cobra.(*Command).ExecuteC() 2024-11-03T22:10:32.0604586Z /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117 +0x655 2024-11-03T22:10:32.0605134Z github.com/spf13/cobra.(*Command).Execute() 2024-11-03T22:10:32.0605778Z /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041 +0x5a9 2024-11-03T22:10:32.0606373Z github.com/spf13/cobra.(*Command).ExecuteContext() 2024-11-03T22:10:32.0607041Z /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1034 +0x5a4 2024-11-03T22:10:32.0607913Z vitess.io/vitess/go/cmd/vttablet/cli.TestRunFailsToStartTabletManager() 2024-11-03T22:10:32.0608706Z /home/runner/work/vitess/vitess/go/cmd/vttablet/cli/cli_test.go:56 +0x52d 2024-11-03T22:10:32.0609214Z testing.tRunner() 2024-11-03T22:10:32.0609716Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:1690 +0x226 2024-11-03T22:10:32.0610206Z testing.(*T).Run.gowrap1() 2024-11-03T22:10:32.0610726Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:1743 +0x44 2024-11-03T22:10:32.0611088Z 2024-11-03T22:10:32.0611209Z Goroutine 62 (running) created at: 2024-11-03T22:10:32.0611657Z vitess.io/vitess/go/vt/topotools.RebuildKeyspaceLocked() 2024-11-03T22:10:32.0612398Z /home/runner/work/vitess/vitess/go/vt/topotools/rebuild_keyspace.go:149 +0x884 2024-11-03T22:10:32.0613036Z vitess.io/vitess/go/vt/topotools.RebuildKeyspace() 2024-11-03T22:10:32.0613740Z /home/runner/work/vitess/vitess/go/vt/topotools/rebuild_keyspace.go:40 +0x286 2024-11-03T22:10:32.0614521Z vitess.io/vitess/go/vt/vttablet/tabletmanager.(*TabletManager).rebuildKeyspace() 2024-11-03T22:10:32.0615406Z /home/runner/work/vitess/vitess/go/vt/vttablet/tabletmanager/tm_init.go:650 +0x3c4 2024-11-03T22:10:32.0616280Z vitess.io/vitess/go/vt/vttablet/tabletmanager.(*TabletManager).createKeyspaceShard.gowrap2() 2024-11-03T22:10:32.0617211Z /home/runner/work/vitess/vitess/go/vt/vttablet/tabletmanager/tm_init.go:581 +0x8f 2024-11-03T22:10:32.0617662Z 2024-11-03T22:10:32.0617786Z Goroutine 40 (running) created at: 2024-11-03T22:10:32.0618107Z testing.(*T).Run() 2024-11-03T22:10:32.0618611Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:1743 +0x825 2024-11-03T22:10:32.0619096Z testing.runTests.func1() 2024-11-03T22:10:32.0619620Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:2168 +0x85 2024-11-03T22:10:32.0620079Z testing.tRunner() 2024-11-03T22:10:32.0620573Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:1690 +0x226 2024-11-03T22:10:32.0621043Z testing.runTests() 2024-11-03T22:10:32.0621534Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:2166 +0x8be 2024-11-03T22:10:32.0622169Z testing.(*M).Run() 2024-11-03T22:10:32.0622659Z /opt/hostedtoolcache/go/1.23.2/x64/src/testing/testing.go:2034 +0xf17 2024-11-03T22:10:32.0623102Z main.main() 2024-11-03T22:10:32.0623394Z _testmain.go:45 +0x164 2024-11-03T22:10:32.0623663Z ================== ```

I have been through the code and find at least 3 other code paths that can expose a similar race condition too:

vtgate SQL handler -> StreamExecute -> StreamExecutePrimitive -> TryStreamExecute -> VStream -> ExecuteVStream -> startVStream -> stream -> startOneStream -> (go) streamFromTablet -> PickForStreaming -> GetMatchingTablets -> GetCellInfo
vtctld cli -> initSchema -> schemamanager.Run -> executor.Execute -> exec.executeSQL -> exec.executeAlterMigrationThrottle -> exec.ts.UpdateSrvKeyspaceThrottlerConfig -> (go) ts.GetSrvKeyspace -> ts.GetSrvKeyspace -> ts.ConnForCell -> ts.GetCellInfo
vtorc cli -> StartVTOrcDiscovery -> (go) ContinuousDiscovery -> (go) refreshAllTablets -> refreshTabletsUsing -> (go) refreshTabletsInKeyspaceShard -> ts.GetTabletsByShard -> ts.GetTabletsByShardCell -> ts.GetCellInfoNames

Proposed Solutions

I have explored several options to fix this issue:

  1. Have a global mutex in topo.Server, locking it whenever we need to Close it or use the connections. Dropped off this idea as it would impact performance negatively, specially for our gRPC servers that use topo.Server a lot and need fast concurrent operations. gRPC is not even affected by such panic/segfault, as when closing the server we first wait for all gRPC connections to close.
  2. Create a new method that returns the connection. The method would acquire a lock and release it when returning, we would make a copy of the connection, ensuring the same field wouldn't be accessed twice at the same time. However, this expose another issue: when closing topo.Server we also close the topo connection and set it to nil, but in our implementations of the topo connections we don't check if the topo connection is nil or not before using it. For this reason and for a potential performance impact I gave up this idea.
  3. Simply not set the topo.Server fields to nil. This was given up for the same reason (on the topo connection implementation) explained in idea number 2.
  4. Have an atomic counter or boolean. Everytime we need to use the connections we would set a boolean to true or increase a counter, the Close method would only proceed when the boolean is set to false or the counter is equal to 0, this would imply we add some sort of timeout to the Close method to "force-close" after a certain time even if the not all usages of the connection are complete. This method would be a bit more invasive and require a decent amount of changes. Given that the impact of this issue is very small and only happens if the binary was going to stop anyway, I opted for option 5 instead.
  5. Make sure we check the context we are given before accessing the topo connections. This is something that we are currently not always doing. We cancel the context before calling the Close method on the topo.Server, checking the context before accessing the topo connections should avoid a lot of our issues. I think this is a good first approach, if we observe more panics/segfault in the future we could implement option 4.

This PR implements option 5, making sure we are checking the context every time we need to access the topo connection. It is not a 100% safe solution as some race condition may still happen, but: 1. they're already pretty unlikely, 2. this will safeguard a lot of potential race condition already.

vitess-bot[bot] commented 1 week ago

Review Checklist

Hello reviewers! :wave: Please follow this checklist when reviewing this Pull Request.

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 41.72662% with 81 lines in your changes missing coverage. Please review.

Project coverage is 67.38%. Comparing base (78f7db2) to head (8797ec7). Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/topo/vschema.go 38.88% 22 Missing :warning:
go/vt/topo/keyspace.go 42.85% 12 Missing :warning:
go/vt/topo/external_vitess_cluster.go 26.66% 11 Missing :warning:
go/vt/topo/cell_info.go 37.50% 10 Missing :warning:
go/vt/topo/cells_aliases.go 46.66% 8 Missing :warning:
go/vt/topo/metadata.go 33.33% 8 Missing :warning:
go/vt/topo/shard.go 46.66% 8 Missing :warning:
go/vt/topo/locks.go 66.66% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #17165 +/- ## ========================================== + Coverage 67.32% 67.38% +0.06% ========================================== Files 1569 1569 Lines 252552 252681 +129 ========================================== + Hits 170032 170280 +248 + Misses 82520 82401 -119 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

frouioui commented 1 week ago

Adding Do Not Merge as I want to re-run the workflows several times before merging this.

frouioui commented 4 days ago

It does seem as though the last commit has started breaking one of the zk2 topo tests.

@mattlord and I have seen this test flakes in the past. I will merge this PR as the failure observed on the ZK2 test does not seem to be caused by this PR.