Open jordanhendricks opened 1 year ago
To make sure I didn't have some cruft left around, I did a full clean up and a reboot. I saw the same failure when I tried to standup omicron post-reboot.
The ntp zone thinks the clock is well synced:
root@oxz_ntp_517cf03b-7086-4a1c-b6e7-0054588608ab:~# dladm LINK CLASS MTU STATE BRIDGE OVER vopte0 vnic 1500 up -- ? oxControlService4 vnic 9000 up -- ? root@oxz_ntp_517cf03b-7086-4a1c-b6e7-0054588608ab:~# chronyc tracking Reference ID : 7F7F0101 () Stratum : 10 Ref time (UTC) : Mon Sep 25 15:05:11 2023 System time : 0.000000000 seconds fast of NTP time Last offset : +0.000000000 seconds RMS offset : 0.000000000 seconds Frequency : 0.117 ppm fast Residual freq : +0.000 ppm Skew : 0.000 ppm Root delay : 0.000000000 seconds Root dispersion : 0.000000000 seconds Update interval : 0.0 seconds Leap status : Normal
It's a bit misleading, but this is saying that the NTP zone is not synced, but is advertising itself as synced to downstream clients. It's basically running standalone (lost upstream connectivity mode) so sled agent won't consider it synced.
The things that tell me it's in this mode are:
See https://github.com/oxidecomputer/omicron/blob/9c43c8cf2f7dd332289343f864ba32710c2f4b3e/sled-agent/src/services.rs#L2416-L2422 for the logic that determines if we're in sync based on the tracking information.
In this mode, there is presumably a communication problem with the outside world for DNS and/or NTP.
We found that the softnpu zone was talking over igb0, when it should've been talking over the fake etherstub:
sc0_1 vnic 1500 up -- igb0
It seems plausible this got created at some point by the create virtual hw script when PHYSICAL_LINK was not set properly to fake_external_stub0
. Deleting a bunch of vnics by hand and trying the whole flow again seemed to bring the stack up beyond this point.
We also found that I didn't have GATEWAY_MAC
set. This isn't a documented step in the "how to run" doc. Maybe it should be?
Do you mind if I update the name of this issue? I think this is totally a bug worth fixing, but kinda want to clarify that this is related to configuring hardware correctly, rather than an issue with RSS.
@smklein what do you think the bug is here to fix? I think the title accurately reflects the situation I was debugging, particularly because it describes the configuration as a development deployment rather than a rack.
For my part, I think the way forward is a combination of needing better observability into what RSS (and sled agent generally) is doing, a la #1881, and needing more rigorous deployment tools such that it's much harder to end up in this situation (e.g., #4130).
Your comment here seemed to be the root cause: https://github.com/oxidecomputer/omicron/issues/4138#issuecomment-1734223353
I agree that the other issues are worth fixing to improve visibility, but I wanted to clarify that this is more of a "bad config resulted in sadness, how can we prevent + flag that case?" issue rather than "RSS had a good config, but did the wrong thing" issue.
I tried to deploy current main on dunkin, a helios lab machine (dfb6853f).
At first, I hit issues due to chrony not being installed (#4094), but I installed that package, then restarted the baseline service.
After a few laps with trying to stand up omicron (including running an uninstall command from
omicron-package
and running the destroy virtual hw script between laps), sled agent is stuck trying to timesync.These are the zones that came up:
The ntp zone thinks the clock is well synced:
On the sled agent side, I see: