Open gvsg-rs opened 7 months ago
If we run against a setup that has a high number of tables, we can exceed the limit of threads and Readyset will panic:
Apr 11 17:20:35 ip-10-0-5-246 readyset[2686075]: Thread count overflowed the configured max count. Thread index = 4097, max threads = 4096.
Apr 11 17:20:35 ip-10-0-5-246 readyset[2686075]: thread 'Domain 1493.0.0' panicked at readyset-dataflow/src/domain/mod.rs:732:10:
Apr 11 17:20:35 ip-10-0-5-246 readyset[2686075]: called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(6925), ...)
A workaround is to limit the number of tables either via:
Limit the number of databases in the --upstream-db-url
Limit the number of tables via --replication-tables
or --replication-tables-ignore
Previously, we used tokio::task::block_in_place to run domains, which blocks the thread it's currently running on until the task completes. This prevents the executor from using that thread to make progress on any other tasks, which is not efficient. CL-1268 removed the
block_in_place
strategy by spawning a native OS thread for each domain instead, which is how the system works today. This is highly resource-inefficient and has also led to us hitting the upper limits on the number of threads with active tracing spans (4096).A quote from that CL:
Generally speaking, blocking I/O bound work is very well-suited to tokio's built-in blocking threadpool. Further, it is now possible to configure the size of the blocking threadpool using the max_blocking_threads method on the runtime builder. We should re-investigate the performance of tokio's
spawn_blocking
method in the context of domains.For work that is CPU-bound, we should consider using the rayon crate, which is typically the go-to tool for spawning blocking CPU-bound tasks.