oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 38 forks source link

[rss] could check more for obviously-incoherent configuration #6714

Open iximeow opened 2 days ago

iximeow commented 2 days ago

i was modifying smf/sled-agent/non-gimlet/config-rss.toml to run Omicron on a local workstation, and did a fairly unwise thing:

# The IP ranges configured as part of the "internal services" IP Pool
#
# This is a range of *external* IP addresses that will be assigned to the
# External DNS server(s) and Nexus instances.  These are the IPs that users
# *outside* the rack will use to reach these services.  These will also be used
# by services like NTP as their externally-facing address for NAT.
#
# For more on this and what to put here, see docs/how-to-run.adoc.
[[internal_services_ip_pool_ranges]]
first = "192.168.0.128"
last = "192.168.0.140"

...
[rack_network_config]
...
# You can configure multiple uplinks by repeating the following stanza
[[rack_network_config.ports]]
# Routes associated with this port.
routes = [{nexthop = "192.168.0.1", destination = "0.0.0.0/0"}]
# Addresses associated with this port.
addresses = [{address = "192.168.0.130/24"}]

note that internal_services_ip_pool_ranges contains rack_network_config.addresses[0]. so, RSS gets on its merry way of setting up NTP, DNS, etc but then partway through it sets stuck (not exactly sure what it was stuck on, which is its own problem)

trying to debug this, dig oxide.computer in the NTP zone would hang, as would dig 0.pool.ntp.org (the default and configured ntp_servers name). however, chronyc -n sources listed four IPs, so 0.pool.ntp.org resolved at some point earlier. 1.1.1.1 and 9.9.9.9 were reachable from the global zone of the host, and those hostnames resolved the entire time without issue.

what appears to have happened is that one of the services that was started was done so on 192.168.0.130, which conflicts with the router's 192.168.0.130, and so networking to the outside world was fully wedged.

i think we'd be able to check this kind of thing here, though i see some similar validation in wicket here? the former seems like what i'd run into locally, but i'm not sure which workflows involve reconfiguration through wicket.