oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

ClickHouse on Dogfood appears unable to spawn threads for connections #6923

Closed bnaecker closed 1 month ago

bnaecker commented 1 month ago

I logged into the ClickHouse zone on the dogfood rack to experiment with something, and the official ClickHouse client CLI appears unable to make connections to the server:

root@oxz_clickhouse_aa646c82:~# clickhouse client --host fd00:1122:3344:107::4 --database oximeter
ClickHouse client version 23.8.7.1.
Connecting to database oximeter at fd00:1122:3344:107::4:9000 as user default.
Code: 210. DB::NetException: Net Exception: Connection reset by peer (fd00:1122:3344:107::4:9000). (NETWORK_ERROR)

The service appears to be running, but tailing the log greets one with a spew of this:

Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
2024.10.23 21:39:23.981922 [ 373 ] {} <Trace> HTTPHandler-factory: HTTP Request for HTTPHandler-factory. Method: GET, Address: [fd00:1122:3344:10a::3]:45228, User-Agent: (none), Content Type: , Transfer Encoding: identity, X-Forwarded-For: (none)
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'
Got exception while starting thread for connection. Error code: 0, message: 'No thread available'

Client connections are spawned in new threads, and ClickHouse is apparently unable to do that. I worry this is due to the connection pooling we've been adding recently, either alone or in combination with the OxQL queries we're making from outside. I don't have any other details at this point.

bnaecker commented 1 month ago

Just noting that this is causing oximeter to drop some samples too, because the database appears unavailable:

21:45:34.775Z WARN oximeter (oximeter-agent): failed to insert some results into metric DB: Telemetry database unavailable: error sending request for url (http://[fd00:1122:3344:107::4]:8123/?output_format_json_quote_64bit_integers=0&wait_end_of_query=1)
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:343
bnaecker commented 1 month ago

The first question I have is: "who is making connections to ClickHouse?"

Looking at connstat, we see an enormous number of connections:

root@oxz_clickhouse_aa646c82:~# connstat | wc -l
   21466

Those appear to be almost exclusively from one peer:

root@oxz_clickhouse_aa646c82:~# connstat -P -o raddr | sort | uniq -c
   1 0.0.0.0
   2 ::
21461 fd00:1122:3344:103::3
 194 fd00:1122:3344:10a::3

Ok, who is that dominant peer? From the switch zone we can use omdb to see which sled owns that /64:

root@oxz_switch0:~# omdb db sleds
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (110.0.0)
note: listing all commissioned sleds (use -F to filter, e.g. -F in-service)
 SERIAL       IP                             ROLE      POLICY             STATE   ID
 BRM42220057  [fd00:1122:3344:104::1]:12345  -         in service         active  0c7011f7-a4bf-4daf-90cc-1c2410103300
 BRM44220011  [fd00:1122:3344:103::1]:12345  -         in service         active  2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa
 BRM42220017  [fd00:1122:3344:109::1]:12345  -         in service         active  5f6720b8-8a31-45f8-8c94-8e699218f28b
 BRM42220051  [fd00:1122:3344:10b::1]:12345  scrimlet  in service         active  71def415-55ad-46b4-ba88-3ca55d7fb287
 BRM44220010  [fd00:1122:3344:101::1]:12345  -         in service         active  7b862eb6-7f50-4c2f-b9a6-0d12ac913d3c
 BRM42220014  [fd00:1122:3344:108::1]:12345  scrimlet  in service         active  87c2c4fc-b0c7-4fef-a305-78f0ed265bbc
 BRM42220031  [fd00:1122:3344:102::1]:12345  -         in service         active  a2adea92-b56e-44fc-8a0d-7d63b5fd3b93
 BRM42220006  [fd00:1122:3344:106::1]:12345  -         in service         active  b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88
 BRM27230045  [fd00:1122:3344:123::1]:12345  -         not provisionable  active  bd96ef7c-4941-4729-b6f7-5f47feecbc4b
 BRM42220009  [fd00:1122:3344:107::1]:12345  -         not provisionable  active  db183874-65b5-4263-a1c1-ddb2737ae0e9
 BRM42220016  [fd00:1122:3344:10a::1]:12345  -         in service         active  dd83e75a-1edf-4aa1-89a0-cd8b2091a7cd
 BRM44220005  [fd00:1122:3344:105::1]:12345  -         in service         active  f15774c1-b8e5-434f-a493-ec43f96cba06

So that appears to be from BRM44220011. That's running a Nexus zone, which is one of the primary peers I'd expect to be talking to the database:

root@oxz_switch0:~# omdb sled-agent zones list --sled-agent-url http://[fd00:1122:3344:103::1]:12345
zones:
    "oxz_crucible_01f93020-7e7d-4185-93fb-6ca234056c82"
    "oxz_crucible_072fdae8-2adf-4fd2-94ce-e9b0663b91e7"
    "oxz_crucible_0b41c560-3b20-42f4-82ad-92f5bb575d6b"
    "oxz_crucible_0ccf27c0-e32d-4b52-a2c5-6db0c64a26f9"
    "oxz_crucible_585cd8c5-c41e-4be4-beb8-bfbef9b53856"
    "oxz_crucible_7153983f-8fd7-4fb9-92ac-0f07a07798b4"
    "oxz_crucible_7d44ba36-4a69-490a-bc40-f6f90a4208d4"
    "oxz_crucible_a6ba8273-0320-4dab-b801-281f041b0c50"
    "oxz_crucible_b9b7b4c2-284a-4ec1-80ea-75b7a43b71c4"
    "oxz_crucible_e238116d-e5cc-43d4-9c8a-6f138ae8a15d"
    "oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7"
    "oxz_ntp_7a85d50e-b524-41c1-a052-118027eb77db"
    "oxz_propolis-server_6e637a89-2bb1-4327-89ac-6f55757fd3ec"
    "oxz_propolis-server_770f0d5f-5f87-4d8d-a4d7-7b16602f2408"

And we do see that Nexus actually has that address from the zone (there may be a better way to answer this question from omdb directly):

root@oxz_nexus_65a11c18:~# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
opte0/public      dhcp     ok           172.30.2.5/32
lo0/v6            static   ok           ::1/128
oxControlService11/ll addrconf ok       fe80::8:20ff:fea0:7894%oxControlService11/10
oxControlService11/omicron6 static ok   fd00:1122:3344:103::3/64

I'm looking at the Nexus zone now to see who's making those.

bnaecker commented 1 month ago

So tracing the connection initiation requests, we see this:

root@oxz_nexus_65a11c18:~# dtrace -n 'syscall::connect:entry { @s[ustack()] = count() }' -n 'tick-5s { printa(@s); exit(0); }' | demangle
dtrace: description 'syscall::connect:entry ' matched 1 probe
dtrace: description 'tick-5s ' matched 1 probe
CPU     ID                    FUNCTION:NAME
102  83706                         :tick-5s

              libc.so.1`__so_connect+0xa
              libsocket.so.1`__xnet_connect+0x20
              nexus`socket2::socket::Socket::connect::ha84b246a25a25bc2+0x11
              nexus`tokio::net::tcp::socket::TcpSocket::connect::_$u7b$$u7b$closure$u7d$$u7d$::hee2064ce65f35984 (.llvm.493235490398768526)+0x8c
              nexus`<tokio::time::timeout::Timeout<T> as core::future::future::Future>::poll::h779dec244837c2d8+0x57
              nexus`_$LT$hyper_util..client..legacy..connect..http..HttpConnector$LT$R$GT$$u20$as$u20$tower_service..Service$LT$http..uri..Uri$GT$$GT$::call::_$u7b$$u7b$closure$u7d$$u7d$::hf2dfdfbd1d841723 (.llvm.493235490398768526)+0x15ca
              nexus`<hyper_tls::client::HttpsConnector<T> as tower_service::Service<http::uri::Uri>>::call::{{closure}}::he380e93710428451+0xa7
              nexus`reqwest::connect::Connector::connect_with_maybe_proxy::_$u7b$$u7b$closure$u7d$$u7d$::ha0008491dfc374f5 (.llvm.493235490398768526)+0x506
              nexus`<tokio::time::timeout::Timeout<T> as core::future::future::Future>::poll::hba69ffa88c019da0+0x51
              nexus`reqwest::connect::with_timeout::{{closure}}::he9b72e11e4c21b2b+0x1eb
              nexus`<hyper_util::service::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll::h776641a5f2736c96+0xc6
              nexus`<futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll::h455634d23bea4049+0x46
              nexus`<futures_util::future::try_future::try_flatten::TryFlatten<Fut,<Fut as futures_core::future::TryFuture>::Ok> as core::future::future::Future>::poll::h73f61a9fee7470bc+0x8a
              nexus`<hyper_util::common::lazy::Lazy<F,R> as core::future::future::Future>::poll::he69c75153b48ad46+0x57d
              nexus`<futures_util::future::select::Select<A,B> as core::future::future::Future>::poll::h4055650920fb23af+0x4d
              nexus`hyper_util::client::legacy::client::Client<C,B>::send_request::{{closure}}::h0ef69981ad9dd2bd+0x9d2
              nexus`<hyper_util::client::legacy::client::ResponseFuture as core::future::future::Future>::poll::h04f6f02c685e038a+0x16
              nexus`<reqwest::async_impl::client::PendingRequest as core::future::future::Future>::poll::h46db6dfd6289a8cb+0x117
              nexus`<reqwest::async_impl::client::Pending as core::future::future::Future>::poll::hc96d0bb5b15cbe72+0x39
              nexus`nexus_inventory::collector::Collector::collect_one_mgs::{{closure}}::h643aa9efb0b432a9+0x6ee
                1

              libc.so.1`__so_connect+0xa
              libsocket.so.1`__xnet_connect+0x20
              nexus`socket2::socket::Socket::connect::ha84b246a25a25bc2+0x11
              nexus`tokio::net::tcp::socket::TcpSocket::connect::_$u7b$$u7b$closure$u7d$$u7d$::hee2064ce65f35984 (.llvm.493235490398768526)+0x8c
              nexus`_$LT$hyper_util..client..legacy..connect..http..HttpConnector$LT$R$GT$$u20$as$u20$tower_service..Service$LT$http..uri..Uri$GT$$GT$::call::_$u7b$$u7b$closure$u7d$$u7d$::hf2dfdfbd1d841723 (.llvm.493235490398768526)+0x13c3
              nexus`<hyper_tls::client::HttpsConnector<T> as tower_service::Service<http::uri::Uri>>::call::{{closure}}::he380e93710428451+0xa7
              nexus`reqwest::connect::Connector::connect_with_maybe_proxy::{{closure}}::ha0008491dfc374f5+0x59f
              nexus`reqwest::connect::with_timeout::{{closure}}::he9b72e11e4c21b2b+0x89
              nexus`<hyper_util::service::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll::h776641a5f2736c96+0xc6
              nexus`<futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll::h455634d23bea4049+0x46
              nexus`<futures_util::future::try_future::try_flatten::TryFlatten<Fut,<Fut as futures_core::future::TryFuture>::Ok> as core::future::future::Future>::poll::h73f61a9fee7470bc+0x8a
              nexus`<hyper_util::common::lazy::Lazy<F,R> as core::future::future::Future>::poll::he69c75153b48ad46+0x57d
              nexus`<futures_util::future::select::Select<A,B> as core::future::future::Future>::poll::h4055650920fb23af+0x4d
              nexus`hyper_util::client::legacy::client::Client<C,B>::send_request::{{closure}}::h0ef69981ad9dd2bd+0x9d2
              nexus`<hyper_util::client::legacy::client::ResponseFuture as core::future::future::Future>::poll::h04f6f02c685e038a+0x16
              nexus`<reqwest::async_impl::client::PendingRequest as core::future::future::Future>::poll::h46db6dfd6289a8cb+0x117
              nexus`<reqwest::async_impl::client::Pending as core::future::future::Future>::poll::hc96d0bb5b15cbe72+0x39
              nexus`oximeter_db::client::Client::execute_with_body::{{closure}}::ha20c0c426735d9b9+0x5ec
              nexus`oximeter_db::client::Client::get_schema_locked::{{closure}}::h99bc77900a38faa6+0x309
              nexus`oximeter_db::client::oxql::<impl oximeter_db::client::Client>::run_oxql_query::{{closure}}::h8bdcd15ee41178e1+0x67b
                2

              libc.so.1`__so_connect+0xa
              libsocket.so.1`__xnet_connect+0x20
              nexus`mio::net::tcp::stream::TcpStream::connect::h23c92d69661bdb3c+0xaf
              nexus`<oximeter_db::native::connection::Connector as qorb::backend::Connector>::connect::{{closure}}::h9f6f0997ccc25014+0x1177
              nexus`<tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::hfcb72d789ac943e2+0x2b7
              nexus`qorb::slot::SetWorker$LT$Conn$GT$::create_slot::_$u7b$$u7b$closure$u7d$$u7d$::h69deaa67b1405c05 (.llvm.346072908401228557)+0x185
              nexus`tokio::runtime::task::core::Core<T,S>::poll::h51e92019f087b50f+0x35
              nexus`tokio::runtime::task::harness::Harness<T,S>::poll::h3f59c59ae8a3f4fa+0x5b
              nexus`tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h414163fa73b11f5b+0x26f
              nexus`tokio::runtime::scheduler::multi_thread::worker::Context::run::h6212db222605c535+0xa0b
              nexus`tokio::runtime::context::set_scheduler::hc7592537afd48faa+0x52
              nexus`tokio::runtime::context::runtime::enter_runtime::hb6491214e4ad4d7b+0x177
              nexus`tokio::runtime::scheduler::multi_thread::worker::run::h2e32b967ca57a033+0x78
              nexus`tokio::runtime::task::core::Core<T,S>::poll::h776a0cba30c2ddf9+0x70
              nexus`tokio::runtime::task::harness::Harness<T,S>::poll::h9ed75bd2bd7fab63+0xab
              nexus`tokio::runtime::blocking::pool::Inner::run::ha60b637057eda82e+0xe4
              nexus`std::sys_common::backtrace::__rust_begin_short_backtrace::h9652bd4e8e39d174+0x3f
              nexus`core::ops::function::FnOnce::call_once{{vtable.shim}}::hacd16954d89ded5d+0x7e
              nexus`std::sys::pal::unix::thread::Thread::new::thread_start::ha88bab0a6c5aedb8+0x1b
              libc.so.1`_thrp_setup+0x77
                3

              libc.so.1`__so_connect+0xa
              libsocket.so.1`__xnet_connect+0x20
              nexus`socket2::socket::Socket::connect::ha84b246a25a25bc2+0x11
              nexus`tokio::net::tcp::socket::TcpSocket::connect::_$u7b$$u7b$closure$u7d$$u7d$::hee2064ce65f35984 (.llvm.493235490398768526)+0x8c
              nexus`<tokio::time::timeout::Timeout<T> as core::future::future::Future>::poll::h779dec244837c2d8+0x57
              nexus`_$LT$hyper_util..client..legacy..connect..http..HttpConnector$LT$R$GT$$u20$as$u20$tower_service..Service$LT$http..uri..Uri$GT$$GT$::call::_$u7b$$u7b$closure$u7d$$u7d$::hf2dfdfbd1d841723 (.llvm.493235490398768526)+0x15ca
              nexus`<hyper_tls::client::HttpsConnector<T> as tower_service::Service<http::uri::Uri>>::call::{{closure}}::he380e93710428451+0xa7
              nexus`reqwest::connect::Connector::connect_with_maybe_proxy::_$u7b$$u7b$closure$u7d$$u7d$::ha0008491dfc374f5 (.llvm.493235490398768526)+0x506
              nexus`<tokio::time::timeout::Timeout<T> as core::future::future::Future>::poll::hba69ffa88c019da0+0x51
              nexus`reqwest::connect::with_timeout::{{closure}}::he9b72e11e4c21b2b+0x1eb
              nexus`<hyper_util::service::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll::h776641a5f2736c96+0xc6
              nexus`<futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll::h455634d23bea4049+0x46
              nexus`<futures_util::future::try_future::try_flatten::TryFlatten<Fut,<Fut as futures_core::future::TryFuture>::Ok> as core::future::future::Future>::poll::h73f61a9fee7470bc+0x8a
              nexus`<hyper_util::common::lazy::Lazy<F,R> as core::future::future::Future>::poll::he69c75153b48ad46+0x57d
              nexus`<futures_util::future::select::Select<A,B> as core::future::future::Future>::poll::h4055650920fb23af+0x4d
              nexus`hyper_util::client::legacy::client::Client<C,B>::send_request::{{closure}}::h0ef69981ad9dd2bd+0x9d2
              nexus`<hyper_util::client::legacy::client::ResponseFuture as core::future::future::Future>::poll::h04f6f02c685e038a+0x16
              nexus`<reqwest::async_impl::client::PendingRequest as core::future::future::Future>::poll::h46db6dfd6289a8cb+0x117
              nexus`<reqwest::async_impl::client::Pending as core::future::future::Future>::poll::hc96d0bb5b15cbe72+0x39
              nexus`<omicron_nexus::app::background::tasks::v2p_mappings::V2PManager as omicron_nexus::app::background::BackgroundTask>::activate::{{closure}}::he4157ee71ece5d3d+0x10eb
               12

              libc.so.1`__so_connect+0xa
              libsocket.so.1`__xnet_connect+0x20
              nexus`mio::net::tcp::stream::TcpStream::connect::h23c92d69661bdb3c+0xaf
              nexus`<oximeter_db::native::connection::Connector as qorb::backend::Connector>::connect::{{closure}}::h9f6f0997ccc25014+0x1177
              nexus`<tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::hfcb72d789ac943e2+0x2b7
              nexus`qorb::slot::SetWorker$LT$Conn$GT$::create_slot::_$u7b$$u7b$closure$u7d$$u7d$::h69deaa67b1405c05 (.llvm.346072908401228557)+0x185
              nexus`tokio::runtime::task::core::Core<T,S>::poll::h51e92019f087b50f+0x35
              nexus`tokio::runtime::task::harness::Harness<T,S>::poll::h3f59c59ae8a3f4fa+0x5b
              nexus`tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h414163fa73b11f5b+0x22b
              nexus`tokio::runtime::scheduler::multi_thread::worker::Context::run::h6212db222605c535+0xa0b
              nexus`tokio::runtime::context::set_scheduler::hc7592537afd48faa+0x52
              nexus`tokio::runtime::context::runtime::enter_runtime::hb6491214e4ad4d7b+0x177
              nexus`tokio::runtime::scheduler::multi_thread::worker::run::h2e32b967ca57a033+0x78
              nexus`tokio::runtime::task::core::Core<T,S>::poll::h776a0cba30c2ddf9+0x70
              nexus`tokio::runtime::task::harness::Harness<T,S>::poll::h9ed75bd2bd7fab63+0xab
              nexus`tokio::runtime::blocking::pool::Inner::run::ha60b637057eda82e+0xe4
              nexus`std::sys_common::backtrace::__rust_begin_short_backtrace::h9652bd4e8e39d174+0x3f
              nexus`core::ops::function::FnOnce::call_once{{vtable.shim}}::hacd16954d89ded5d+0x7e
              nexus`std::sys::pal::unix::thread::Thread::new::thread_start::ha88bab0a6c5aedb8+0x1b
              libc.so.1`_thrp_setup+0x77
             2447

That looks like the native connection pool is currently making ~500 connection requests per second, which is....bad.

bnaecker commented 1 month ago

As expected, we're sending ~500 Hello packets to the server per second here:

root@oxz_nexus_65a11c18:~# dtrace -n 'clickhouse_io*:::packet-sent { @packets[copyinstr(arg0)] = count() }' -n 'tick-5s { printa(@packets); exit(0); }'
dtrace: description 'clickhouse_io*:::packet-sent ' matched 1 probe
dtrace: description 'tick-5s ' matched 1 probe
CPU     ID                    FUNCTION:NAME
103  83706                         :tick-5s
  Hello                                                          1992

That's over 5s, but if we look out for 30s, which should cover the interval qorb uses to test connection validity, we also see some Ping messages:

root@oxz_nexus_65a11c18:~# dtrace -n 'clickhouse_io*:::packet-sent { @packets[copyinstr(arg0)] = count() }' -n 'tick-30s { printa(@packets); exit(0); }'
dtrace: description 'clickhouse_io*:::packet-sent ' matched 1 probe
dtrace: description 'tick-30s ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  2  83708                        :tick-30s
  Ping                                                           1023
  Hello                                                         16614

What packets are we getting back from the server?

root@oxz_nexus_65a11c18:~# dtrace -n 'clickhouse_io*:::packet-received { @packets[copyinstr(arg0)] = count() }' -n 'tick-5s { printa(@packets); exit(0); }'
dtrace: description 'clickhouse_io*:::packet-received ' matched 1 probe
dtrace: description 'tick-5s ' matched 1 probe
CPU     ID                    FUNCTION:NAME
 36  83706                         :tick-5s
  Pong                                                              1

That seems surprising, since it suggests none of those client Hello packets are getting back server Hellos!

bnaecker commented 1 month ago

Another note, not sure if it's related. The clickhouse-admin services are in maintenance mode. From the ClickHouse zone:

root@oxz_clickhouse_aa646c82:~# svcs -x
svc:/oxide/clickhouse-admin-server:default (Omicron ClickHouse Admin Server for Replica Servers)
 State: maintenance since Wed Oct 23 19:58:55 2024
Reason: Restarting too quickly.
   See: http://illumos.org/msg/SMF-8000-L5
   See: /var/svc/log/oxide-clickhouse-admin-server:default.log
Impact: This service is not running.

svc:/oxide/clickhouse-admin-keeper:default (Omicron ClickHouse Admin Keeper Server)
 State: maintenance since Wed Oct 23 19:58:55 2024
Reason: Restarting too quickly.
   See: http://illumos.org/msg/SMF-8000-L5
   See: /var/svc/log/oxide-clickhouse-admin-keeper:default.log
Impact: This service is not running.

Those SMF logs are empty, so they've possibly been rotated out or archived. I'll come back to this if it's relevant, no reason to think so yet.

bnaecker commented 1 month ago

@smklein had this good question: "How many pools are we creating?"

Both he and I initially answered "one", based on this code:

https://github.com/oxidecomputer/omicron/blob/4536a6eda73deb8a728182b4b35566989e1e9723/oximeter/db/src/client/mod.rs#L207-L210

However, I also remember that Nexus has methods for creating pools on-demand if it needs to. That's this LazyTimeseriesClient object:

https://github.com/oxidecomputer/omicron/blob/4536a6eda73deb8a728182b4b35566989e1e9723/nexus/src/app/oximeter.rs#L34-L44

It seems plausible that it's making a new client and thus a new pool each time it tries to handle a request that needs access to the ClickHouse database.

bnaecker commented 1 month ago

So it looks like the oximeter_db::Client::new() method is in fact being called, but it's definitely less frequently than I'd expect:

root@oxz_nexus_65a11c18:~# dtrace -n 'BEGIN { t = timestamp; }' -n 'pid15584::_ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry { printf("%d\n", (timestamp - t) / 1000 / 1000); t = timestamp; }'
dtrace: description 'BEGIN ' matched 1 probe
dtrace: description 'pid15584::_ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
 94  83709 _ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry 439

 26  83709 _ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry 2132

111  83709 _ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry 2350

 78  83709 _ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry 2650

114  83709 _ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry 9763

 35  83709 _ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry 11382

^C

That's from 2 seconds up to about 11 seconds between each of those invocations, which is pretty long. I'm not sure where the rest of the connection requests are coming from, but that should result in roughly:

Taking that together, we would expect in the first second: roughly 3 connection retries (since log(1000) \approx 3) per slot that qorb is trying to maintain, which is a max of 16 per pool in the default policy. That works out to about 64 calls to connect() in the first second. That seems a bit too low to me, especially since there would be only about 6 in the next second (exponential backoff).

But we also do see the connect(2) requests returning EINPROGRESS, which is also consistent:

root@oxz_nexus_65a11c18:~# dtrace -n 'syscall::connect:return /arg0 == -1/ {print(errno); ustack(); exit(0);}'
dtrace: description 'syscall::connect:return ' matched 1 probe
CPU     ID                    FUNCTION:NAME
 26  24652                   connect:return int 0x96

              libc.so.1`__so_connect+0xa
              libsocket.so.1`__xnet_connect+0x20
              nexus`_ZN3mio3net3tcp6stream9TcpStream7connect17h23c92d69661bdb3cE+0xaf
              nexus`_ZN87_$LT$oximeter_db..native..connection..Connector$u20$as$u20$qorb..backend..Connector$GT$7connect28_$u7b$$u7b$closure$u7d$$u7d$17h9f6f0997ccc25014E+0x1177
              nexus`_ZN91_$LT$tracing..instrument..Instrumented$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$4poll17hfcb72d789ac943e2E+0x2b7
              nexus`_ZN4qorb4slot21SetWorker$LT$Conn$GT$11create_slot28_$u7b$$u7b$closure$u7d$$u7d$17h69deaa67b1405c05E.llvm.346072908401228557+0x185
              nexus`_ZN5tokio7runtime4task4core17Core$LT$T$C$S$GT$4poll17h51e92019f087b50fE+0x35
              nexus`_ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h3f59c59ae8a3f4faE+0x5b
              nexus`_ZN5tokio7runtime9scheduler12multi_thread6worker7Context8run_task17h414163fa73b11f5bE+0x22b
              nexus`_ZN5tokio7runtime9scheduler12multi_thread6worker7Context3run17h6212db222605c535E+0xa0b
              nexus`_ZN5tokio7runtime7context13set_scheduler17hc7592537afd48faaE+0x52
              nexus`_ZN5tokio7runtime7context7runtime13enter_runtime17hb6491214e4ad4d7bE+0x177
              nexus`_ZN5tokio7runtime9scheduler12multi_thread6worker3run17h2e32b967ca57a033E+0x78
              nexus`_ZN5tokio7runtime4task4core17Core$LT$T$C$S$GT$4poll17h776a0cba30c2ddf9E+0x70
              nexus`_ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h9ed75bd2bd7fab63E+0xab
              nexus`_ZN5tokio7runtime8blocking4pool5Inner3run17ha60b637057eda82eE+0xe4
              nexus`_ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h9652bd4e8e39d174E+0x3f
              nexus`_ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17hacd16954d89ded5dE+0x7e
              nexus`_ZN3std3sys3pal4unix6thread6Thread3new12thread_start17ha88bab0a6c5aedb8E+0x1b
              libc.so.1`_thrp_setup+0x77

since 0x96 == 150, which is EINPROGRESS.

bnaecker commented 1 month ago

Ah, I did the math wrong above. That should be log_2(1s), not log_10. That means we have about 6 connect requests per second, per slot per pool, which is about 2 16 6 which is 192 requests per second. That better explains what we see, with hundreds of requests per second overall.

bnaecker commented 1 month ago

And as a last bit of evidence, we don't see any other calls into the oximeter_db::Client::new() method that initiate from any other point of code (in this Nexus process), over a 30s window:

root@oxz_nexus_65a11c18:~# dtrace -n 'pid15584::_ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry {  @stacks[ustack()] = count() }' -n 'tick-30s { printa(@stacks); exit(0); }' | demangle
dtrace: description 'pid15584::_ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry ' matched 1 probe
dtrace: description 'tick-30s ' matched 1 probe
CPU     ID                    FUNCTION:NAME
 18  83708                        :tick-30s

              nexus`oximeter_db::client::Client::new::h9f04b5a780ada004
              nexus`omicron_nexus::app::oximeter::LazyTimeseriesClient::get::{{closure}}::hc21d40fc0e390ea1+0x334
              nexus`_$LT$FuncType$u20$as$u20$dropshot..handler..HttpHandlerFunc$LT$Context$C$$LP$T0$C$$RP$$C$ResponseType$GT$$GT$::handle_request::_$u7b$$u7b$closure$u7d$$u7d$::h7bee05e6470e5f5b (.llvm.11044820811040527684)+0x712
              nexus`<dropshot::handler::HttpRouteHandler<Context,HandlerType,FuncParams,ResponseType> as dropshot::handler::RouteHandler<Context>>::handle_request::{{closure}}::hd85326fe1682ce5a+0x29d
              nexus`dropshot::server::http_request_handle::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h6332c65786135a0c (.llvm.16910368522175159780)+0x111
              nexus`tokio::runtime::task::core::Core<T,S>::poll::h49642c9c18711445+0x3b
              nexus`tokio::runtime::task::harness::Harness<T,S>::poll::hda2fe15bdd1e0b03+0x6c
              nexus`tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h414163fa73b11f5b+0x22b
              nexus`tokio::runtime::scheduler::multi_thread::worker::Context::run::h6212db222605c535+0xa0b
              nexus`tokio::runtime::context::set_scheduler::hc7592537afd48faa+0x52
              nexus`tokio::runtime::context::runtime::enter_runtime::hb6491214e4ad4d7b+0x177
              nexus`tokio::runtime::scheduler::multi_thread::worker::run::h2e32b967ca57a033+0x78
              nexus`tokio::runtime::task::core::Core<T,S>::poll::h776a0cba30c2ddf9+0x70
              nexus`tokio::runtime::task::harness::Harness<T,S>::poll::h9ed75bd2bd7fab63+0xab
              nexus`tokio::runtime::blocking::pool::Inner::run::ha60b637057eda82e+0xe4
              nexus`std::sys_common::backtrace::__rust_begin_short_backtrace::h9652bd4e8e39d174+0x3f
              nexus`core::ops::function::FnOnce::call_once{{vtable.shim}}::hacd16954d89ded5d+0x7e
              nexus`std::sys::pal::unix::thread::Thread::new::thread_start::ha88bab0a6c5aedb8+0x1b
              libc.so.1`_thrp_setup+0x77
              libc.so.1`_lwp_start
                3
jgallagher commented 1 month ago

Another note, not sure if it's related. The clickhouse-admin services are in maintenance mode. From the ClickHouse zone:

root@oxz_clickhouse_aa646c82:~# svcs -x
svc:/oxide/clickhouse-admin-server:default (Omicron ClickHouse Admin Server for Replica Servers)
 State: maintenance since Wed Oct 23 19:58:55 2024
Reason: Restarting too quickly.
   See: http://illumos.org/msg/SMF-8000-L5
   See: /var/svc/log/oxide-clickhouse-admin-server:default.log
Impact: This service is not running.

svc:/oxide/clickhouse-admin-keeper:default (Omicron ClickHouse Admin Keeper Server)
 State: maintenance since Wed Oct 23 19:58:55 2024
Reason: Restarting too quickly.
   See: http://illumos.org/msg/SMF-8000-L5
   See: /var/svc/log/oxide-clickhouse-admin-keeper:default.log
Impact: This service is not running.

Those SMF logs are empty, so they've possibly been rotated out or archived. I'll come back to this if it's relevant, no reason to think so yet.

I think this is unrelated? @andrewjstone or @karencfv can correct me if this is wrong, but I think these are related to multinode clickhouse and not supposed to be running. I found the rotated logfiles for these, and in both cases they're failing to start because the executable they're looking for isn't present in the zone (which is maybe correct if these are trying to start multinode clickhouse inside a single node clickhouse zone?):

[ Oct 23 19:58:55 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/clickhouse-admin-keeper/bin/clickhouse-adm
in-keeper run --config /var/svc/manifest/site/clickhouse-admin-keeper/config.toml --http-address unknown --listen-address unknown
--binary-path unknown  &"). ]
ctrun: exec failed: No such file or directory
[ Oct 23 19:58:55 Method "start" exited with status 0. ]
[ Oct 22 00:27:54 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/clickhouse-admin-server/bin/clickhouse-admin-server run --config /var/svc/manifest/site/clickhouse-admin-server/config.toml -http-address unknown --listen-address unknown --binary-path unknown  &"). ]
ctrun: exec failed: No such file or directory
[ Oct 22 00:27:54 Method "start" exited with status 0. ]