vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.72k stars 2.1k forks source link

Feature Request: ensure topo calls use `--topo_read_concurrency` #17275

Open timvaillancourt opened 5 days ago

timvaillancourt commented 5 days ago

Feature Description

The --topo_read_concurrency flag is intended (based on it's help message and usage in the code) to limit how many reads a Vitess component does concurrently to the topo

Unfortunately in https://github.com/vitessio/vitess/pull/14693 and https://github.com/vitessio/vitess/pull/17071 (and at least 2 x more PRs for VTOrc to come) Vitess code is not always respecting this concurrency limit and is simply launching N x goroutines with sync.WaitGroups

Another issue is sometimes methods do respect the --topo_read_concurrency limit, but only local to the the method execution, and that code is called concurrently, significantly exceeding the limit

Instead of having the "callers" of go/vt/topo methods (internal or external) deal with concurrency limits all over the code, this feature request discusses making the concurrency limit built-in to all read-related struct methods of go/vt/topo's Server struct

I believe if this is done correctly the concurrency limit should be respected by all methods, including if they're stacked or called repeatedly/concurrently in other methods/funcs

Your thoughts appreciated 🙇

Use Case(s)

Regular vitess usage

timvaillancourt commented 5 days ago

A question for all: the default --topo_read_concurrency is 32

Do we want both local and global topo calls to use that same limit, or should each have their own semaphore == to --topo_read_concurrency (each has 32)? I'm leaning towards the later but we should make it clear

deepthi commented 5 days ago

I'd lean towards each having their own semaphore. And yes, we should document it.

timvaillancourt commented 4 days ago

@deepthi sounds good, this is implemented here: https://github.com/vitessio/vitess/pull/17276

The "cell" topo reads share a single semaphore that defaults to 32. A drawback to this approach is if 1 x cell is really slow it may starve the semaphore for "cell" reads. One way to avoid this is to give each cell connection a new semaphore that is a subset of the total. I'm curious what your thoughts were?