oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
248 stars 39 forks source link

Test flake: failure to sync NTP #4307

Closed jgallagher closed 12 months ago

jgallagher commented 12 months ago

https://github.com/oxidecomputer/omicron/pull/4305/checks?check_run_id=17908233198 failed on an innocuous change (bumping the SP version, which is not used in CI tests at all yet). The specific error that triggered the failure is a sequence of failures to log in, which eventually times out (~10 minutes):

...
2023-10-20 17:42:26.709172171 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
2023-10-20 17:42:27.711043866 UTC: attempting to log into API
2023-10-20 17:42:42.743305427 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
2023-10-20 17:42:43.744381066 UTC: attempting to log into API
2023-10-20 17:42:58.777480954 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
2023-10-20 17:42:59.778226376 UTC: attempting to log into API
2023-10-20 17:43:14.810998417 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
2023-10-20 17:43:15.811739399 UTC: attempting to log into API
2023-10-20 17:43:30.843854357 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
2023-10-20 17:43:31.844650630 UTC: attempting to log into API
2023-10-20 17:43:46.878150046 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
Error: logging in

Caused by:
    timed out after 609.29703172s

No Nexus logs were created; from sled-agent's logs, it looks like we never got time synchronized. The first time sync log messages start at 17:33:

17:33:38.278Z INFO SledAgent (RSS): Waiting for rack time synchronization
    file = sled-agent/src/rack_setup/service.rs:468
17:33:38.311Z INFO SledAgent (RSS): Checking time synchronization for [fd00:1122:3344:101::1]:12345...
    file = sled-agent/src/rack_setup/service.rs:448
17:33:38.320Z INFO SledAgent (ServiceManager): chronyc command failed: Error running command in zone 'oxz_ntp_1f3dd88e-2680-4aed-ad69-acd5b518c0b2': Command [/usr/bin/chronyc -c tracking] executed and failed with status: exit status: 1  stdout: 506 Cannot talk to daemon
      stderr:
    file = sled-agent/src/services.rs:2473
17:33:38.323Z WARN SledAgent (RSS): Time is not yet synchronized
    error = "Time is synchronized on 0/1 sleds"
    file = sled-agent/src/rack_setup/service.rs:498
17:33:38.443Z INFO SledAgent (RSS): Checking time synchronization for [fd00:1122:3344:101::1]:12345...
    file = sled-agent/src/rack_setup/service.rs:448
17:33:38.453Z INFO SledAgent (RSS): Timesync for [fd00:1122:3344:101::1]:12345 TimeSync { sync: false, ref_id: 2139029761, ip_addr: ::, stratum: 10, ref_time: 1697823217.262026, correction: 0.0 }
    file = sled-agent/src/rack_setup/service.rs:476
17:33:38.453Z WARN SledAgent (RSS): Time is not yet synchronized
    error = "Time is synchronized on 0/1 sleds"
    file = sled-agent/src/rack_setup/service.rs:498

and the last time sync log message is just under 10 minutes later:

17:42:57.904Z INFO SledAgent (RSS): Checking time synchronization for [fd00:1122:3344:101::1]:12345...
    file = sled-agent/src/rack_setup/service.rs:448
17:42:57.918Z INFO SledAgent (RSS): Timesync for [fd00:1122:3344:101::1]:12345 TimeSync { sync: false, ref_id: 2139029761, ip_addr: ::, stratum: 10, ref_time: 1697823776.7655292, correction: 0.0 }
    file = sled-agent/src/rack_setup/service.rs:476
17:42:57.918Z WARN SledAgent (RSS): Time is not yet synchronized
    error = "Time is synchronized on 0/1 sleds"
    file = sled-agent/src/rack_setup/service.rs:498

The chrony tracking.log seems relatively unhelpful?

===================================================================================================================================
   Date (UTC) Time     IP Address   St   Freq ppm   Skew ppm     Offset L Co  Offset sd Rem. corr. Root delay Root disp. Max. error
===================================================================================================================================
2023-10-20 17:33:38 0.0.0.0          0     -0.000 1000000.000  0.000e+00 ?  0  0.000e+00  0.000e+00  1.000e+00  1.000e+00  1.500e+00
bnaecker commented 12 months ago

Hit the same flake a few times during CI runs for #4298, see here for the check run.

leftwo commented 12 months ago

Hit the same flake a few times during CI runs for #4306 , see here for the check run.

luqmana commented 12 months ago

From chat: this was due to a mismatch in the sidecar-lite version getting pulled in. #3986 (now in main) includes a change to create_virtual_hardware.sh which should fix this.