oxidecomputer / maghemite

A routing stack written in Rust.
Mozilla Public License 2.0
36 stars 2 forks source link

Let the oximeter producer server use DNS all the time #305

Closed bnaecker closed 3 months ago

bnaecker commented 3 months ago
bnaecker commented 3 months ago

I'd like to test this manually before merging, will report back.

bnaecker commented 3 months ago

Testing this got blocked by https://github.com/oxidecomputer/omicron/issues/6149. I'm going to test against a known-good commit today.

bnaecker commented 3 months ago

Ok, I was finally able to test this on a good commit in Omicron. As expected, the services attempt to use internal DNS to resolve the Nexus address; and then spin trying to register with that address. Here's the mg-ddm service in the global zone, for reference, first the snippet showing the use of internal DNS:

bnaecker@shale : ~/omicron $ tail -F $(svcs -L mg-ddm) | looker
Jul 25 17:38:42.030 INFO new DNS resolver, addresses: [[fd00:1122:3344:1::1]:53, [fd00:1122:3344:2::1]:53, [fd00:1122:3344:3::1]:53, [fd00:1122:3344:4::1]:53, [fd00:1122:3344:5::1]:53], component: internal-dns-resolver, file: /home/build/.cargo/git/checkouts/omicron-d039c41f152bda83/c5ed4de/internal-dns/src/resolver.rs:60
Jul 25 17:38:42.031 DEBG starting producer registration task
Jul 25 17:38:42.031 INFO starting oximeter metric producer server, interval: 1s, address: [fd00:1122:3344:101::1]:8001, producer_id: e96de436-f714-4ca7-9fa0-fed1cbc54c25, file: /home/build/.cargo/git/checkouts/omicron-d039c41f152bda83/c5ed4de/oximeter/producer/src/lib.rs:283
Jul 25 17:38:42.031 DEBG registering / renewing oximeter producer lease with Nexus, component: producer-registration-task
Jul 25 17:38:42.034 WARN failed to lookup Nexus IP, will retry, error: "proto error: io error: No route to host (os error 148)", delay: 140.400929ms, component: producer-registration-task, file: /home/build/.cargo/git/checkouts/omicron-d039c41f152bda83/c5ed4de/oximeter/producer/src/lib.rs:391
Jul 25 17:38:42.177 WARN failed to lookup Nexus IP, will retry, error: "proto error: io error: No route to host (os error 148)", delay: 667.828771ms, component: producer-registration-task, file: /home/build/.cargo/git/checkouts/omicron-d039c41f152bda83/c5ed4de/oximeter/producer/src/lib.rs:391
Jul 25 17:38:42.847 WARN failed to lookup Nexus IP, will retry, error: "proto error: io error: No route to host (os error 148)", delay: 777.402709ms, component: producer-registration-task, file: /home/build/.cargo/git/checkouts/omicron-d039c41f152bda83/c5ed4de/oximeter/producer/src/lib.rs:391
Jul 25 17:38:43.626 WARN failed to lookup Nexus IP, will retry, error: "proto error: io error: No route to host (os error 148)", delay: 1.703690402s, component: producer-registration-task, file: /home/build/.cargo/git/checkouts/omicron-d039c41f152bda83/c5ed4de/oximeter/producer/src/lib.rs:391
Jul 25 17:38:45.331 WARN failed to lookup Nexus IP, will retry, error: "proto error: io error: No route to host (os error 148)", delay: 2.615421571s, component: producer-registration-task, file: /home/build/.cargo/git/checkouts/omicron-d039c41f152bda83/c5ed4de/oximeter/producer/src/lib.rs:391
Jul 25 17:38:47.948 WARN failed to lookup Nexus IP, will retry, error: "proto error: io error: No route to host (os error 148)", delay: 11.641825778s, component: producer-registration-task, file: /home/build/.cargo/git/checkouts/omicron-d039c41f152bda83/c5ed4de/oximeter/producer/src/lib.rs:391
17:38:55.195Z WARN slog-rs: [net1] admin event in solicit state: Announce(Underlay({Ipv6Net { addr: fd00:1122:3344:1::, width: 64 }}))
17:38:55.195Z WARN slog-rs: [net0] admin event in solicit state: Announce(Underlay({Ipv6Net { addr: fd00:1122:3344:1::, width: 64 }}))
17:38:55.691Z WARN slog-rs: [net1] admin event in solicit state: Announce(Underlay({Ipv6Net { addr: fd00:1122:3344:2::, width: 64 }}))
17:38:55.691Z WARN slog-rs: [net0] admin event in solicit state: Announce(Underlay({Ipv6Net { addr: fd00:1122:3344:2::, width: 64 }}))
17:39:00.333Z WARN slog-rs: [net1] admin event in solicit state: Announce(Underlay({Ipv6Net { addr: fd00:1122:3344:3::, width: 64 }}))
17:39:00.333Z WARN slog-rs: [net0] admin event in solicit state: Announce(Underlay({Ipv6Net { addr: fd00:1122:3344:3::, width: 64 }}))
Jul 25 17:39:09.597 DEBG lookup_socket_v6 srv, response: SrvLookup(Lookup { query: Query { name: Name("_nexus._tcp.control-plane.oxide.internal"), query_type: SRV, query_class: IN }, records: [Record { name_labels: Name("_nexus._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 12221, target: Name("7041babb-6b81-41bb-8374-825a6d03b940.host.control-plane.oxide.internal.") })) }, Record { name_labels: Name("_nexus._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 12221, target: Name("d2f48005-9353-46af-bffb-7dc5f7963e19.host.control-plane.oxide.internal.") })) }, Record { name_labels: Name("_nexus._tcp.control-plane.oxide.internal."), rr_type: SRV, dns_class: IN, ttl: 0, rdata: Some(SRV(SRV { priority: 0, weight: 0, port: 12221, target: Name("dceb85ea-f13f-47f2-be27-a791c4d53d37.host.control-plane.oxide.internal.") })) }, Record { name_labels: Name("dceb85ea-f13f-47f2-be27-a791c4d53d37.host.control-plane.oxide.internal."), rr_type: AAAA, dns_class: IN, ttl: 0, rdata: Some(AAAA(fd00:1122:3344:101::c)) }], valid_until: Instant { tv_sec: 65235, tv_nsec: 118743497 } }), dns_name: _nexus._tcp.control-plane.oxide.internal, component: internal-dns-resolver
Jul 25 17:39:09.598 DEBG using nexus address for registration, addr: [fd00:1122:3344:101::a]:12221, component: producer-registration-task

And then it spins registering, which works once Nexus is accepting connections on the underlay:

Jul 25 17:44:42.682 DEBG client request, body: Some(Body), uri: http://[fd00:1122:3344:101::a]:12221/metrics/producers, method: POST, component: producer-registration-task
Jul 25 17:44:57.683 DEBG client response, result: Err(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv6(fd00:1122:3344:101::a)), port: Some(12221), path: "/metrics/producers", query: None, fragment: None }, source: TimedOut }), component: producer-registration-task
Jul 25 17:44:57.684 WARN failed to register as a producer with Nexus, will retry, error: "Communication Error: error sending request for url (http://[fd00:1122:3344:101::a]:12221/metrics/producers): operation timed out", delay: 182.325883652s, component: producer-registration-task, file: /home/build/.cargo/git/checkouts/omicron-d039c41f152bda83/c5ed4de/oximeter/producer/src/lib.rs:424
Jul 25 17:48:00.066 DEBG client request, body: Some(Body), uri: http://[fd00:1122:3344:101::a]:12221/metrics/producers, method: POST, component: producer-registration-task
Jul 25 17:48:10.268 DEBG client response, result: Err(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv6(fd00:1122:3344:101::a)), port: Some(12221), path: "/metrics/producers", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 146, kind: ConnectionRefused, message: "Connection refused" })) }), component: producer-registration-task
Jul 25 17:48:10.277 WARN failed to register as a producer with Nexus, will retry, error: "Communication Error: error sending request for url (http://[fd00:1122:3344:101::a]:12221/metrics/producers): error trying to connect: tcp connect error: Connection refused (os error 146)", delay: 188.106849958s, component: producer-registration-task, file: /home/build/.cargo/git/checkouts/omicron-d039c41f152bda83/c5ed4de/oximeter/producer/src/lib.rs:424
Jul 25 17:51:18.414 DEBG client request, body: Some(Body), uri: http://[fd00:1122:3344:101::a]:12221/metrics/producers, method: POST, component: producer-registration-task
Jul 25 17:51:18.484 DEBG client response, result: Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv6(fd00:1122:3344:101::a)), port: Some(12221), path: "/metrics/producers", query: None, fragment: None }, status: 201, headers: {"content-type": "application/json", "x-request-id": "d3b42f59-c39c-4c26-b782-d2bfaa3ecffe", "content-length": "41", "date": "Thu, 25 Jul 2024 17:51:18 GMT"} }), component: producer-registration-task
Jul 25 17:51:18.488 DEBG registered with nexus successfully, component: producer-registration-task
Jul 25 17:51:18.488 DEBG pausing until time to renew lease, wait_period: 150s, lease_duration: 600s, component: producer-registration-task
Jul 25 17:51:19.509 INFO accepted connection, remote_addr: [fd00:1122:3344:101::d]:55344, local_addr: [fd00:1122:3344:101::1]:8001, component: dropshot, file: /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/7b594d0/dropshot/src/server.rs:775
Jul 25 17:51:19.533 INFO request completed, latency_us: 21072, response_code: 200, uri: /e96de436-f714-4ca7-9fa0-fed1cbc54c25, method: GET, req_id: f0a32b7e-e3b0-43ae-a230-d16e64d438c2, remote_addr: [fd00:1122:3344:101::d]:55344, local_addr: [fd00:1122:3344:101::1]:8001, component: dropshot, file: /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/7b594d0/dropshot/src/server.rs:914
rcgoodfellow commented 3 months ago

I want to make sure I'm reading this correctly. By setting oximeter_producer::Config::registration_address to None, we're asking oximeter_producer::Server to resolve Nexus on its own?

bnaecker commented 3 months ago

That's correct. We specifically tell it to use the IPv6 underlay address to create a resolver, based on the rack /48 contained in the underlay address. Each time it needs to renew its lease, it will create a resolver, lookup Nexus, and register again.