Closed bnaecker closed 1 month ago
Just noting that this is causing oximeter
to drop some samples too, because the database appears unavailable:
21:45:34.775Z WARN oximeter (oximeter-agent): failed to insert some results into metric DB: Telemetry database unavailable: error sending request for url (http://[fd00:1122:3344:107::4]:8123/?output_format_json_quote_64bit_integers=0&wait_end_of_query=1)
collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
collector_ip = fd00:1122:3344:10a::3
file = oximeter/collector/src/agent.rs:343
The first question I have is: "who is making connections to ClickHouse?"
Looking at connstat
, we see an enormous number of connections:
root@oxz_clickhouse_aa646c82:~# connstat | wc -l
21466
Those appear to be almost exclusively from one peer:
root@oxz_clickhouse_aa646c82:~# connstat -P -o raddr | sort | uniq -c
1 0.0.0.0
2 ::
21461 fd00:1122:3344:103::3
194 fd00:1122:3344:10a::3
Ok, who is that dominant peer? From the switch zone we can use omdb
to see which sled owns that /64
:
root@oxz_switch0:~# omdb db sleds
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (110.0.0)
note: listing all commissioned sleds (use -F to filter, e.g. -F in-service)
SERIAL IP ROLE POLICY STATE ID
BRM42220057 [fd00:1122:3344:104::1]:12345 - in service active 0c7011f7-a4bf-4daf-90cc-1c2410103300
BRM44220011 [fd00:1122:3344:103::1]:12345 - in service active 2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa
BRM42220017 [fd00:1122:3344:109::1]:12345 - in service active 5f6720b8-8a31-45f8-8c94-8e699218f28b
BRM42220051 [fd00:1122:3344:10b::1]:12345 scrimlet in service active 71def415-55ad-46b4-ba88-3ca55d7fb287
BRM44220010 [fd00:1122:3344:101::1]:12345 - in service active 7b862eb6-7f50-4c2f-b9a6-0d12ac913d3c
BRM42220014 [fd00:1122:3344:108::1]:12345 scrimlet in service active 87c2c4fc-b0c7-4fef-a305-78f0ed265bbc
BRM42220031 [fd00:1122:3344:102::1]:12345 - in service active a2adea92-b56e-44fc-8a0d-7d63b5fd3b93
BRM42220006 [fd00:1122:3344:106::1]:12345 - in service active b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88
BRM27230045 [fd00:1122:3344:123::1]:12345 - not provisionable active bd96ef7c-4941-4729-b6f7-5f47feecbc4b
BRM42220009 [fd00:1122:3344:107::1]:12345 - not provisionable active db183874-65b5-4263-a1c1-ddb2737ae0e9
BRM42220016 [fd00:1122:3344:10a::1]:12345 - in service active dd83e75a-1edf-4aa1-89a0-cd8b2091a7cd
BRM44220005 [fd00:1122:3344:105::1]:12345 - in service active f15774c1-b8e5-434f-a493-ec43f96cba06
So that appears to be from BRM44220011
. That's running a Nexus zone, which is one of the primary peers I'd expect to be talking to the database:
root@oxz_switch0:~# omdb sled-agent zones list --sled-agent-url http://[fd00:1122:3344:103::1]:12345
zones:
"oxz_crucible_01f93020-7e7d-4185-93fb-6ca234056c82"
"oxz_crucible_072fdae8-2adf-4fd2-94ce-e9b0663b91e7"
"oxz_crucible_0b41c560-3b20-42f4-82ad-92f5bb575d6b"
"oxz_crucible_0ccf27c0-e32d-4b52-a2c5-6db0c64a26f9"
"oxz_crucible_585cd8c5-c41e-4be4-beb8-bfbef9b53856"
"oxz_crucible_7153983f-8fd7-4fb9-92ac-0f07a07798b4"
"oxz_crucible_7d44ba36-4a69-490a-bc40-f6f90a4208d4"
"oxz_crucible_a6ba8273-0320-4dab-b801-281f041b0c50"
"oxz_crucible_b9b7b4c2-284a-4ec1-80ea-75b7a43b71c4"
"oxz_crucible_e238116d-e5cc-43d4-9c8a-6f138ae8a15d"
"oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7"
"oxz_ntp_7a85d50e-b524-41c1-a052-118027eb77db"
"oxz_propolis-server_6e637a89-2bb1-4327-89ac-6f55757fd3ec"
"oxz_propolis-server_770f0d5f-5f87-4d8d-a4d7-7b16602f2408"
And we do see that Nexus actually has that address from the zone (there may be a better way to answer this question from omdb
directly):
root@oxz_nexus_65a11c18:~# ipadm
ADDROBJ TYPE STATE ADDR
lo0/v4 static ok 127.0.0.1/8
opte0/public dhcp ok 172.30.2.5/32
lo0/v6 static ok ::1/128
oxControlService11/ll addrconf ok fe80::8:20ff:fea0:7894%oxControlService11/10
oxControlService11/omicron6 static ok fd00:1122:3344:103::3/64
I'm looking at the Nexus zone now to see who's making those.
So tracing the connection initiation requests, we see this:
root@oxz_nexus_65a11c18:~# dtrace -n 'syscall::connect:entry { @s[ustack()] = count() }' -n 'tick-5s { printa(@s); exit(0); }' | demangle
dtrace: description 'syscall::connect:entry ' matched 1 probe
dtrace: description 'tick-5s ' matched 1 probe
CPU ID FUNCTION:NAME
102 83706 :tick-5s
libc.so.1`__so_connect+0xa
libsocket.so.1`__xnet_connect+0x20
nexus`socket2::socket::Socket::connect::ha84b246a25a25bc2+0x11
nexus`tokio::net::tcp::socket::TcpSocket::connect::_$u7b$$u7b$closure$u7d$$u7d$::hee2064ce65f35984 (.llvm.493235490398768526)+0x8c
nexus`<tokio::time::timeout::Timeout<T> as core::future::future::Future>::poll::h779dec244837c2d8+0x57
nexus`_$LT$hyper_util..client..legacy..connect..http..HttpConnector$LT$R$GT$$u20$as$u20$tower_service..Service$LT$http..uri..Uri$GT$$GT$::call::_$u7b$$u7b$closure$u7d$$u7d$::hf2dfdfbd1d841723 (.llvm.493235490398768526)+0x15ca
nexus`<hyper_tls::client::HttpsConnector<T> as tower_service::Service<http::uri::Uri>>::call::{{closure}}::he380e93710428451+0xa7
nexus`reqwest::connect::Connector::connect_with_maybe_proxy::_$u7b$$u7b$closure$u7d$$u7d$::ha0008491dfc374f5 (.llvm.493235490398768526)+0x506
nexus`<tokio::time::timeout::Timeout<T> as core::future::future::Future>::poll::hba69ffa88c019da0+0x51
nexus`reqwest::connect::with_timeout::{{closure}}::he9b72e11e4c21b2b+0x1eb
nexus`<hyper_util::service::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll::h776641a5f2736c96+0xc6
nexus`<futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll::h455634d23bea4049+0x46
nexus`<futures_util::future::try_future::try_flatten::TryFlatten<Fut,<Fut as futures_core::future::TryFuture>::Ok> as core::future::future::Future>::poll::h73f61a9fee7470bc+0x8a
nexus`<hyper_util::common::lazy::Lazy<F,R> as core::future::future::Future>::poll::he69c75153b48ad46+0x57d
nexus`<futures_util::future::select::Select<A,B> as core::future::future::Future>::poll::h4055650920fb23af+0x4d
nexus`hyper_util::client::legacy::client::Client<C,B>::send_request::{{closure}}::h0ef69981ad9dd2bd+0x9d2
nexus`<hyper_util::client::legacy::client::ResponseFuture as core::future::future::Future>::poll::h04f6f02c685e038a+0x16
nexus`<reqwest::async_impl::client::PendingRequest as core::future::future::Future>::poll::h46db6dfd6289a8cb+0x117
nexus`<reqwest::async_impl::client::Pending as core::future::future::Future>::poll::hc96d0bb5b15cbe72+0x39
nexus`nexus_inventory::collector::Collector::collect_one_mgs::{{closure}}::h643aa9efb0b432a9+0x6ee
1
libc.so.1`__so_connect+0xa
libsocket.so.1`__xnet_connect+0x20
nexus`socket2::socket::Socket::connect::ha84b246a25a25bc2+0x11
nexus`tokio::net::tcp::socket::TcpSocket::connect::_$u7b$$u7b$closure$u7d$$u7d$::hee2064ce65f35984 (.llvm.493235490398768526)+0x8c
nexus`_$LT$hyper_util..client..legacy..connect..http..HttpConnector$LT$R$GT$$u20$as$u20$tower_service..Service$LT$http..uri..Uri$GT$$GT$::call::_$u7b$$u7b$closure$u7d$$u7d$::hf2dfdfbd1d841723 (.llvm.493235490398768526)+0x13c3
nexus`<hyper_tls::client::HttpsConnector<T> as tower_service::Service<http::uri::Uri>>::call::{{closure}}::he380e93710428451+0xa7
nexus`reqwest::connect::Connector::connect_with_maybe_proxy::{{closure}}::ha0008491dfc374f5+0x59f
nexus`reqwest::connect::with_timeout::{{closure}}::he9b72e11e4c21b2b+0x89
nexus`<hyper_util::service::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll::h776641a5f2736c96+0xc6
nexus`<futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll::h455634d23bea4049+0x46
nexus`<futures_util::future::try_future::try_flatten::TryFlatten<Fut,<Fut as futures_core::future::TryFuture>::Ok> as core::future::future::Future>::poll::h73f61a9fee7470bc+0x8a
nexus`<hyper_util::common::lazy::Lazy<F,R> as core::future::future::Future>::poll::he69c75153b48ad46+0x57d
nexus`<futures_util::future::select::Select<A,B> as core::future::future::Future>::poll::h4055650920fb23af+0x4d
nexus`hyper_util::client::legacy::client::Client<C,B>::send_request::{{closure}}::h0ef69981ad9dd2bd+0x9d2
nexus`<hyper_util::client::legacy::client::ResponseFuture as core::future::future::Future>::poll::h04f6f02c685e038a+0x16
nexus`<reqwest::async_impl::client::PendingRequest as core::future::future::Future>::poll::h46db6dfd6289a8cb+0x117
nexus`<reqwest::async_impl::client::Pending as core::future::future::Future>::poll::hc96d0bb5b15cbe72+0x39
nexus`oximeter_db::client::Client::execute_with_body::{{closure}}::ha20c0c426735d9b9+0x5ec
nexus`oximeter_db::client::Client::get_schema_locked::{{closure}}::h99bc77900a38faa6+0x309
nexus`oximeter_db::client::oxql::<impl oximeter_db::client::Client>::run_oxql_query::{{closure}}::h8bdcd15ee41178e1+0x67b
2
libc.so.1`__so_connect+0xa
libsocket.so.1`__xnet_connect+0x20
nexus`mio::net::tcp::stream::TcpStream::connect::h23c92d69661bdb3c+0xaf
nexus`<oximeter_db::native::connection::Connector as qorb::backend::Connector>::connect::{{closure}}::h9f6f0997ccc25014+0x1177
nexus`<tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::hfcb72d789ac943e2+0x2b7
nexus`qorb::slot::SetWorker$LT$Conn$GT$::create_slot::_$u7b$$u7b$closure$u7d$$u7d$::h69deaa67b1405c05 (.llvm.346072908401228557)+0x185
nexus`tokio::runtime::task::core::Core<T,S>::poll::h51e92019f087b50f+0x35
nexus`tokio::runtime::task::harness::Harness<T,S>::poll::h3f59c59ae8a3f4fa+0x5b
nexus`tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h414163fa73b11f5b+0x26f
nexus`tokio::runtime::scheduler::multi_thread::worker::Context::run::h6212db222605c535+0xa0b
nexus`tokio::runtime::context::set_scheduler::hc7592537afd48faa+0x52
nexus`tokio::runtime::context::runtime::enter_runtime::hb6491214e4ad4d7b+0x177
nexus`tokio::runtime::scheduler::multi_thread::worker::run::h2e32b967ca57a033+0x78
nexus`tokio::runtime::task::core::Core<T,S>::poll::h776a0cba30c2ddf9+0x70
nexus`tokio::runtime::task::harness::Harness<T,S>::poll::h9ed75bd2bd7fab63+0xab
nexus`tokio::runtime::blocking::pool::Inner::run::ha60b637057eda82e+0xe4
nexus`std::sys_common::backtrace::__rust_begin_short_backtrace::h9652bd4e8e39d174+0x3f
nexus`core::ops::function::FnOnce::call_once{{vtable.shim}}::hacd16954d89ded5d+0x7e
nexus`std::sys::pal::unix::thread::Thread::new::thread_start::ha88bab0a6c5aedb8+0x1b
libc.so.1`_thrp_setup+0x77
3
libc.so.1`__so_connect+0xa
libsocket.so.1`__xnet_connect+0x20
nexus`socket2::socket::Socket::connect::ha84b246a25a25bc2+0x11
nexus`tokio::net::tcp::socket::TcpSocket::connect::_$u7b$$u7b$closure$u7d$$u7d$::hee2064ce65f35984 (.llvm.493235490398768526)+0x8c
nexus`<tokio::time::timeout::Timeout<T> as core::future::future::Future>::poll::h779dec244837c2d8+0x57
nexus`_$LT$hyper_util..client..legacy..connect..http..HttpConnector$LT$R$GT$$u20$as$u20$tower_service..Service$LT$http..uri..Uri$GT$$GT$::call::_$u7b$$u7b$closure$u7d$$u7d$::hf2dfdfbd1d841723 (.llvm.493235490398768526)+0x15ca
nexus`<hyper_tls::client::HttpsConnector<T> as tower_service::Service<http::uri::Uri>>::call::{{closure}}::he380e93710428451+0xa7
nexus`reqwest::connect::Connector::connect_with_maybe_proxy::_$u7b$$u7b$closure$u7d$$u7d$::ha0008491dfc374f5 (.llvm.493235490398768526)+0x506
nexus`<tokio::time::timeout::Timeout<T> as core::future::future::Future>::poll::hba69ffa88c019da0+0x51
nexus`reqwest::connect::with_timeout::{{closure}}::he9b72e11e4c21b2b+0x1eb
nexus`<hyper_util::service::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll::h776641a5f2736c96+0xc6
nexus`<futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll::h455634d23bea4049+0x46
nexus`<futures_util::future::try_future::try_flatten::TryFlatten<Fut,<Fut as futures_core::future::TryFuture>::Ok> as core::future::future::Future>::poll::h73f61a9fee7470bc+0x8a
nexus`<hyper_util::common::lazy::Lazy<F,R> as core::future::future::Future>::poll::he69c75153b48ad46+0x57d
nexus`<futures_util::future::select::Select<A,B> as core::future::future::Future>::poll::h4055650920fb23af+0x4d
nexus`hyper_util::client::legacy::client::Client<C,B>::send_request::{{closure}}::h0ef69981ad9dd2bd+0x9d2
nexus`<hyper_util::client::legacy::client::ResponseFuture as core::future::future::Future>::poll::h04f6f02c685e038a+0x16
nexus`<reqwest::async_impl::client::PendingRequest as core::future::future::Future>::poll::h46db6dfd6289a8cb+0x117
nexus`<reqwest::async_impl::client::Pending as core::future::future::Future>::poll::hc96d0bb5b15cbe72+0x39
nexus`<omicron_nexus::app::background::tasks::v2p_mappings::V2PManager as omicron_nexus::app::background::BackgroundTask>::activate::{{closure}}::he4157ee71ece5d3d+0x10eb
12
libc.so.1`__so_connect+0xa
libsocket.so.1`__xnet_connect+0x20
nexus`mio::net::tcp::stream::TcpStream::connect::h23c92d69661bdb3c+0xaf
nexus`<oximeter_db::native::connection::Connector as qorb::backend::Connector>::connect::{{closure}}::h9f6f0997ccc25014+0x1177
nexus`<tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::hfcb72d789ac943e2+0x2b7
nexus`qorb::slot::SetWorker$LT$Conn$GT$::create_slot::_$u7b$$u7b$closure$u7d$$u7d$::h69deaa67b1405c05 (.llvm.346072908401228557)+0x185
nexus`tokio::runtime::task::core::Core<T,S>::poll::h51e92019f087b50f+0x35
nexus`tokio::runtime::task::harness::Harness<T,S>::poll::h3f59c59ae8a3f4fa+0x5b
nexus`tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h414163fa73b11f5b+0x22b
nexus`tokio::runtime::scheduler::multi_thread::worker::Context::run::h6212db222605c535+0xa0b
nexus`tokio::runtime::context::set_scheduler::hc7592537afd48faa+0x52
nexus`tokio::runtime::context::runtime::enter_runtime::hb6491214e4ad4d7b+0x177
nexus`tokio::runtime::scheduler::multi_thread::worker::run::h2e32b967ca57a033+0x78
nexus`tokio::runtime::task::core::Core<T,S>::poll::h776a0cba30c2ddf9+0x70
nexus`tokio::runtime::task::harness::Harness<T,S>::poll::h9ed75bd2bd7fab63+0xab
nexus`tokio::runtime::blocking::pool::Inner::run::ha60b637057eda82e+0xe4
nexus`std::sys_common::backtrace::__rust_begin_short_backtrace::h9652bd4e8e39d174+0x3f
nexus`core::ops::function::FnOnce::call_once{{vtable.shim}}::hacd16954d89ded5d+0x7e
nexus`std::sys::pal::unix::thread::Thread::new::thread_start::ha88bab0a6c5aedb8+0x1b
libc.so.1`_thrp_setup+0x77
2447
That looks like the native connection pool is currently making ~500 connection requests per second, which is....bad.
As expected, we're sending ~500 Hello
packets to the server per second here:
root@oxz_nexus_65a11c18:~# dtrace -n 'clickhouse_io*:::packet-sent { @packets[copyinstr(arg0)] = count() }' -n 'tick-5s { printa(@packets); exit(0); }'
dtrace: description 'clickhouse_io*:::packet-sent ' matched 1 probe
dtrace: description 'tick-5s ' matched 1 probe
CPU ID FUNCTION:NAME
103 83706 :tick-5s
Hello 1992
That's over 5s, but if we look out for 30s, which should cover the interval qorb
uses to test connection validity, we also see some Ping
messages:
root@oxz_nexus_65a11c18:~# dtrace -n 'clickhouse_io*:::packet-sent { @packets[copyinstr(arg0)] = count() }' -n 'tick-30s { printa(@packets); exit(0); }'
dtrace: description 'clickhouse_io*:::packet-sent ' matched 1 probe
dtrace: description 'tick-30s ' matched 1 probe
CPU ID FUNCTION:NAME
2 83708 :tick-30s
Ping 1023
Hello 16614
What packets are we getting back from the server?
root@oxz_nexus_65a11c18:~# dtrace -n 'clickhouse_io*:::packet-received { @packets[copyinstr(arg0)] = count() }' -n 'tick-5s { printa(@packets); exit(0); }'
dtrace: description 'clickhouse_io*:::packet-received ' matched 1 probe
dtrace: description 'tick-5s ' matched 1 probe
CPU ID FUNCTION:NAME
36 83706 :tick-5s
Pong 1
That seems surprising, since it suggests none of those client Hello
packets are getting back server Hello
s!
Another note, not sure if it's related. The clickhouse-admin
services are in maintenance mode. From the ClickHouse zone:
root@oxz_clickhouse_aa646c82:~# svcs -x
svc:/oxide/clickhouse-admin-server:default (Omicron ClickHouse Admin Server for Replica Servers)
State: maintenance since Wed Oct 23 19:58:55 2024
Reason: Restarting too quickly.
See: http://illumos.org/msg/SMF-8000-L5
See: /var/svc/log/oxide-clickhouse-admin-server:default.log
Impact: This service is not running.
svc:/oxide/clickhouse-admin-keeper:default (Omicron ClickHouse Admin Keeper Server)
State: maintenance since Wed Oct 23 19:58:55 2024
Reason: Restarting too quickly.
See: http://illumos.org/msg/SMF-8000-L5
See: /var/svc/log/oxide-clickhouse-admin-keeper:default.log
Impact: This service is not running.
Those SMF logs are empty, so they've possibly been rotated out or archived. I'll come back to this if it's relevant, no reason to think so yet.
@smklein had this good question: "How many pools are we creating?"
Both he and I initially answered "one", based on this code:
However, I also remember that Nexus has methods for creating pools on-demand if it needs to. That's this LazyTimeseriesClient
object:
It seems plausible that it's making a new client and thus a new pool each time it tries to handle a request that needs access to the ClickHouse database.
So it looks like the oximeter_db::Client::new()
method is in fact being called, but it's definitely less frequently than I'd expect:
root@oxz_nexus_65a11c18:~# dtrace -n 'BEGIN { t = timestamp; }' -n 'pid15584::_ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry { printf("%d\n", (timestamp - t) / 1000 / 1000); t = timestamp; }'
dtrace: description 'BEGIN ' matched 1 probe
dtrace: description 'pid15584::_ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry ' matched 1 probe
CPU ID FUNCTION:NAME
94 83709 _ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry 439
26 83709 _ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry 2132
111 83709 _ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry 2350
78 83709 _ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry 2650
114 83709 _ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry 9763
35 83709 _ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry 11382
^C
That's from 2 seconds up to about 11 seconds between each of those invocations, which is pretty long. I'm not sure where the rest of the connection requests are coming from, but that should result in roughly:
qorb
's policy. That looks like exponential backoff, with a multiplier of 2, going from a minimum of 20ms to a maximum of 30s backoff, with some random delay added to avoid thundering herd.Taking that together, we would expect in the first second: roughly 3 connection retries (since log(1000) \approx 3) per slot that qorb
is trying to maintain, which is a max of 16 per pool in the default policy. That works out to about 64 calls to connect()
in the first second. That seems a bit too low to me, especially since there would be only about 6 in the next second (exponential backoff).
But we also do see the connect(2)
requests returning EINPROGRESS
, which is also consistent:
root@oxz_nexus_65a11c18:~# dtrace -n 'syscall::connect:return /arg0 == -1/ {print(errno); ustack(); exit(0);}'
dtrace: description 'syscall::connect:return ' matched 1 probe
CPU ID FUNCTION:NAME
26 24652 connect:return int 0x96
libc.so.1`__so_connect+0xa
libsocket.so.1`__xnet_connect+0x20
nexus`_ZN3mio3net3tcp6stream9TcpStream7connect17h23c92d69661bdb3cE+0xaf
nexus`_ZN87_$LT$oximeter_db..native..connection..Connector$u20$as$u20$qorb..backend..Connector$GT$7connect28_$u7b$$u7b$closure$u7d$$u7d$17h9f6f0997ccc25014E+0x1177
nexus`_ZN91_$LT$tracing..instrument..Instrumented$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$4poll17hfcb72d789ac943e2E+0x2b7
nexus`_ZN4qorb4slot21SetWorker$LT$Conn$GT$11create_slot28_$u7b$$u7b$closure$u7d$$u7d$17h69deaa67b1405c05E.llvm.346072908401228557+0x185
nexus`_ZN5tokio7runtime4task4core17Core$LT$T$C$S$GT$4poll17h51e92019f087b50fE+0x35
nexus`_ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h3f59c59ae8a3f4faE+0x5b
nexus`_ZN5tokio7runtime9scheduler12multi_thread6worker7Context8run_task17h414163fa73b11f5bE+0x22b
nexus`_ZN5tokio7runtime9scheduler12multi_thread6worker7Context3run17h6212db222605c535E+0xa0b
nexus`_ZN5tokio7runtime7context13set_scheduler17hc7592537afd48faaE+0x52
nexus`_ZN5tokio7runtime7context7runtime13enter_runtime17hb6491214e4ad4d7bE+0x177
nexus`_ZN5tokio7runtime9scheduler12multi_thread6worker3run17h2e32b967ca57a033E+0x78
nexus`_ZN5tokio7runtime4task4core17Core$LT$T$C$S$GT$4poll17h776a0cba30c2ddf9E+0x70
nexus`_ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h9ed75bd2bd7fab63E+0xab
nexus`_ZN5tokio7runtime8blocking4pool5Inner3run17ha60b637057eda82eE+0xe4
nexus`_ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h9652bd4e8e39d174E+0x3f
nexus`_ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17hacd16954d89ded5dE+0x7e
nexus`_ZN3std3sys3pal4unix6thread6Thread3new12thread_start17ha88bab0a6c5aedb8E+0x1b
libc.so.1`_thrp_setup+0x77
since 0x96
== 150, which is EINPROGRESS
.
Ah, I did the math wrong above. That should be log_2(1s)
, not log_10
. That means we have about 6 connect requests per second, per slot per pool, which is about 2 16 6 which is 192 requests per second. That better explains what we see, with hundreds of requests per second overall.
And as a last bit of evidence, we don't see any other calls into the oximeter_db::Client::new()
method that initiate from any other point of code (in this Nexus process), over a 30s window:
root@oxz_nexus_65a11c18:~# dtrace -n 'pid15584::_ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry { @stacks[ustack()] = count() }' -n 'tick-30s { printa(@stacks); exit(0); }' | demangle
dtrace: description 'pid15584::_ZN11oximeter_db6client6Client3new17h9f04b5a780ada004E:entry ' matched 1 probe
dtrace: description 'tick-30s ' matched 1 probe
CPU ID FUNCTION:NAME
18 83708 :tick-30s
nexus`oximeter_db::client::Client::new::h9f04b5a780ada004
nexus`omicron_nexus::app::oximeter::LazyTimeseriesClient::get::{{closure}}::hc21d40fc0e390ea1+0x334
nexus`_$LT$FuncType$u20$as$u20$dropshot..handler..HttpHandlerFunc$LT$Context$C$$LP$T0$C$$RP$$C$ResponseType$GT$$GT$::handle_request::_$u7b$$u7b$closure$u7d$$u7d$::h7bee05e6470e5f5b (.llvm.11044820811040527684)+0x712
nexus`<dropshot::handler::HttpRouteHandler<Context,HandlerType,FuncParams,ResponseType> as dropshot::handler::RouteHandler<Context>>::handle_request::{{closure}}::hd85326fe1682ce5a+0x29d
nexus`dropshot::server::http_request_handle::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h6332c65786135a0c (.llvm.16910368522175159780)+0x111
nexus`tokio::runtime::task::core::Core<T,S>::poll::h49642c9c18711445+0x3b
nexus`tokio::runtime::task::harness::Harness<T,S>::poll::hda2fe15bdd1e0b03+0x6c
nexus`tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h414163fa73b11f5b+0x22b
nexus`tokio::runtime::scheduler::multi_thread::worker::Context::run::h6212db222605c535+0xa0b
nexus`tokio::runtime::context::set_scheduler::hc7592537afd48faa+0x52
nexus`tokio::runtime::context::runtime::enter_runtime::hb6491214e4ad4d7b+0x177
nexus`tokio::runtime::scheduler::multi_thread::worker::run::h2e32b967ca57a033+0x78
nexus`tokio::runtime::task::core::Core<T,S>::poll::h776a0cba30c2ddf9+0x70
nexus`tokio::runtime::task::harness::Harness<T,S>::poll::h9ed75bd2bd7fab63+0xab
nexus`tokio::runtime::blocking::pool::Inner::run::ha60b637057eda82e+0xe4
nexus`std::sys_common::backtrace::__rust_begin_short_backtrace::h9652bd4e8e39d174+0x3f
nexus`core::ops::function::FnOnce::call_once{{vtable.shim}}::hacd16954d89ded5d+0x7e
nexus`std::sys::pal::unix::thread::Thread::new::thread_start::ha88bab0a6c5aedb8+0x1b
libc.so.1`_thrp_setup+0x77
libc.so.1`_lwp_start
3
Another note, not sure if it's related. The
clickhouse-admin
services are in maintenance mode. From the ClickHouse zone:root@oxz_clickhouse_aa646c82:~# svcs -x svc:/oxide/clickhouse-admin-server:default (Omicron ClickHouse Admin Server for Replica Servers) State: maintenance since Wed Oct 23 19:58:55 2024 Reason: Restarting too quickly. See: http://illumos.org/msg/SMF-8000-L5 See: /var/svc/log/oxide-clickhouse-admin-server:default.log Impact: This service is not running. svc:/oxide/clickhouse-admin-keeper:default (Omicron ClickHouse Admin Keeper Server) State: maintenance since Wed Oct 23 19:58:55 2024 Reason: Restarting too quickly. See: http://illumos.org/msg/SMF-8000-L5 See: /var/svc/log/oxide-clickhouse-admin-keeper:default.log Impact: This service is not running.
Those SMF logs are empty, so they've possibly been rotated out or archived. I'll come back to this if it's relevant, no reason to think so yet.
I think this is unrelated? @andrewjstone or @karencfv can correct me if this is wrong, but I think these are related to multinode clickhouse and not supposed to be running. I found the rotated logfiles for these, and in both cases they're failing to start because the executable they're looking for isn't present in the zone (which is maybe correct if these are trying to start multinode clickhouse inside a single node clickhouse zone?):
[ Oct 23 19:58:55 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/clickhouse-admin-keeper/bin/clickhouse-adm
in-keeper run --config /var/svc/manifest/site/clickhouse-admin-keeper/config.toml --http-address unknown --listen-address unknown
--binary-path unknown &"). ]
ctrun: exec failed: No such file or directory
[ Oct 23 19:58:55 Method "start" exited with status 0. ]
[ Oct 22 00:27:54 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/clickhouse-admin-server/bin/clickhouse-admin-server run --config /var/svc/manifest/site/clickhouse-admin-server/config.toml -http-address unknown --listen-address unknown --binary-path unknown &"). ]
ctrun: exec failed: No such file or directory
[ Oct 22 00:27:54 Method "start" exited with status 0. ]
I logged into the ClickHouse zone on the dogfood rack to experiment with something, and the official ClickHouse client CLI appears unable to make connections to the server:
The service appears to be running, but tailing the log greets one with a spew of this:
Client connections are spawned in new threads, and ClickHouse is apparently unable to do that. I worry this is due to the connection pooling we've been adding recently, either alone or in combination with the OxQL queries we're making from outside. I don't have any other details at this point.