oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
248 stars 38 forks source link

panic in sync_switch_configuration.rs with "bgp config is present but announce set is not populated" #5361

Open sunshowers opened 6 months ago

sunshowers commented 6 months ago

During today's dogfood mupdate, we found a core dump on gc08 (rsync'd over to /staff/dock/rack2/mupdate-20240329/cores/sled-08/core.oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7.nexus.5125.1711667683).

Based on timestamps, this corresponds to this message in the log file /pool/ext/8a199f12-4f5c-483a-8aca-f97856658a35/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus:default.log.1711677599:

thread 'tokio-runtime-worker' panicked at nexus/src/app/background/sync_switch_configuration.rs:735:26: 
bgp config is present but announce set is not populated
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Mar 28 23:14:43 Stopping because all processes in service exited. ]
[ Mar 28 23:14:43 Executing stop method (:kill). ]

The assertion is here.

cc @internet-diglett who this code annotates to, and @rcgoodfellow for the nearby TODO.

sunshowers commented 6 months ago

Note that this corresponded to some network flakiness that was going on around that time (2024-03-28T23:14:41.638474412Z).

rcgoodfellow commented 6 months ago

I believe CRDB may have been unavailable during this time?

rcgoodfellow commented 6 months ago

Looks like it, just above the panic I see

23:14:39.337Z ERRO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): failed to collect inventory
    background_task = service_zone_nat_tracker
    error = Service Unavailable: Failed to access DB connection: Timed out in bb8
    file = nexus/src/app/background/sync_service_zone_nat.rs:71
...
23:14:41.465Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): failed to read DNS config
    background_task = dns_config_internal
    current_generation = 1
    current_time_created = 2023-08-30 18:59:10.774294 UTC
    dns_group = internal
    error = Service Unavailable: Failed to access DB connection: Timed out in bb8
    file = nexus/src/app/background/dns_config.rs:72
internet-diglett commented 6 months ago

@sunshowers thanks for catching this, a few expects snuck through. Patching this now.