RSS of single-node non-gimlet deployment got stuck trying to time sync

jordanhendricks commented 1 year ago

I tried to deploy current main on dunkin, a helios lab machine (dfb6853f).

At first, I hit issues due to chrony not being installed (#4094), but I installed that package, then restarted the baseline service.

After a few laps with trying to stand up omicron (including running an uninstall command from omicron-package and running the destroy virtual hw script between laps), sled agent is stuck trying to timesync.

These are the zones that came up:

global
sidecar_softnpu
oxz_switch
oxz_internal_dns_f8491614-6fd1-40f5-88c7-6d433054ec43
oxz_internal_dns_9295a7fc-6942-4ee6-98dc-803429a63d98
oxz_internal_dns_7e1a3d47-8086-45a6-b88d-3700f8b9cff7
oxz_ntp_517cf03b-7086-4a1c-b6e7-0054588608ab

The ntp zone thinks the clock is well synced:

root@oxz_ntp_517cf03b-7086-4a1c-b6e7-0054588608ab:~# dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
vopte0      vnic      1500   up       --         ?
oxControlService4 vnic 9000  up       --         ?
root@oxz_ntp_517cf03b-7086-4a1c-b6e7-0054588608ab:~# chronyc tracking
Reference ID    : 7F7F0101 ()
Stratum         : 10
Ref time (UTC)  : Mon Sep 25 15:05:11 2023
System time     : 0.000000000 seconds fast of NTP time
Last offset     : +0.000000000 seconds
RMS offset      : 0.000000000 seconds
Frequency       : 0.117 ppm fast
Residual freq   : +0.000 ppm
Skew            : 0.000 ppm
Root delay      : 0.000000000 seconds
Root dispersion : 0.000000000 seconds
Update interval : 0.0 seconds
Leap status     : Normal

On the sled agent side, I see:

15:05:12.485Z INFO SledAgent (RSS): Timesync for [fd00:1122:3344:101::1]:12345 TimeSync { sync: false, ref_id: 2139029761, ip_addr: ::, stratum: 10, r
ef_time: 1695654311.4054208, correction: 0.0 }                                                                                                        
    file = sled-agent/src/rack_setup/service.rs:476                                                                                                   
15:05:12.485Z WARN SledAgent (RSS): Time is not yet synchronized           
    error = "Time is synchronized on 0/1 sleds"

jordanhendricks commented 1 year ago

To make sure I didn't have some cruft left around, I did a full clean up and a reboot. I saw the same failure when I tried to standup omicron post-reboot.

citrus-it commented 1 year ago

The ntp zone thinks the clock is well synced:

root@oxz_ntp_517cf03b-7086-4a1c-b6e7-0054588608ab:~# dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
vopte0      vnic      1500   up       --         ?
oxControlService4 vnic 9000  up       --         ?
root@oxz_ntp_517cf03b-7086-4a1c-b6e7-0054588608ab:~# chronyc tracking
Reference ID    : 7F7F0101 ()
Stratum         : 10
Ref time (UTC)  : Mon Sep 25 15:05:11 2023
System time     : 0.000000000 seconds fast of NTP time
Last offset     : +0.000000000 seconds
RMS offset      : 0.000000000 seconds
Frequency       : 0.117 ppm fast
Residual freq   : +0.000 ppm
Skew            : 0.000 ppm
Root delay      : 0.000000000 seconds
Root dispersion : 0.000000000 seconds
Update interval : 0.0 seconds
Leap status     : Normal

It's a bit misleading, but this is saying that the NTP zone is not synced, but is advertising itself as synced to downstream clients. It's basically running standalone (lost upstream connectivity mode) so sled agent won't consider it synced.

The things that tell me it's in this mode are:

stratum 10 - we use 10 when we have no upstream peers;
reference ID 127.127.0.1 (local)
0.000000000 seconds fast... this doesn't happen if we're really syncing to an upstream server.

See https://github.com/oxidecomputer/omicron/blob/9c43c8cf2f7dd332289343f864ba32710c2f4b3e/sled-agent/src/services.rs#L2416-L2422 for the logic that determines if we're in sync based on the tracking information.

In this mode, there is presumably a communication problem with the outside world for DNS and/or NTP.

jordanhendricks commented 1 year ago

We found that the softnpu zone was talking over igb0, when it should've been talking over the fake etherstub:

sc0_1       vnic      1500   up       --         igb0

It seems plausible this got created at some point by the create virtual hw script when PHYSICAL_LINK was not set properly to fake_external_stub0. Deleting a bunch of vnics by hand and trying the whole flow again seemed to bring the stack up beyond this point.

We also found that I didn't have GATEWAY_MAC set. This isn't a documented step in the "how to run" doc. Maybe it should be?

smklein commented 1 year ago

Do you mind if I update the name of this issue? I think this is totally a bug worth fixing, but kinda want to clarify that this is related to configuring hardware correctly, rather than an issue with RSS.

jordanhendricks commented 1 year ago

@smklein what do you think the bug is here to fix? I think the title accurately reflects the situation I was debugging, particularly because it describes the configuration as a development deployment rather than a rack.

For my part, I think the way forward is a combination of needing better observability into what RSS (and sled agent generally) is doing, a la #1881, and needing more rigorous deployment tools such that it's much harder to end up in this situation (e.g., #4130).

smklein commented 1 year ago

Your comment here seemed to be the root cause: https://github.com/oxidecomputer/omicron/issues/4138#issuecomment-1734223353

I agree that the other issues are worth fixing to improve visibility, but I wanted to clarify that this is more of a "bad config resulted in sadness, how can we prevent + flag that case?" issue rather than "RSS had a good config, but did the wrong thing" issue.

oxidecomputer / omicron

RSS of single-node non-gimlet deployment got stuck trying to time sync #4138