oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

crucible zone in rack2 found offline after zone setup failure #6947

Closed davepacheco closed 2 weeks ago

davepacheco commented 1 month ago

I had a provision hang: dogfood instance c49820df-25ea-4c0f-aeb2-5e16980b6e84, saga 311f4acf-798d-435e-9f1c-d8ef75977aac, which is apparently stuck in regions_ensure (based on it being node 173 that has started but not finished and looking in the saga_dag). From the Nexus log, this went bad around here:

16:28:57.487Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): recording saga event
    event_type = Started
    node_id = 173
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
16:28:57.509Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): Region requested, not yet created. Retrying in 295.586665ms
    dataset = 180d466d-eb36-4546-8922-e52c4c076823
    file = nexus/src/app/crucible.rs:230
    region = 33001801-f722-4323-bf54-280f22ba27a8
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
    saga_name = instance-create
16:28:57.561Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): Region requested, not yet created. Retrying in 331.554523ms
    dataset = e238116d-e5cc-43d4-9c8a-6f138ae8a15d
    file = nexus/src/app/crucible.rs:230
    region = c0ca39f8-7fe0-492e-a382-936c1f33d5be
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
    saga_name = instance-create
16:28:57.813Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): Region requested, not yet created. Retrying in 361.352408ms
    dataset = 180d466d-eb36-4546-8922-e52c4c076823
    file = nexus/src/app/crucible.rs:230
    region = 33001801-f722-4323-bf54-280f22ba27a8
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
    saga_name = instance-create
16:29:12.498Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): saw transient communication error error sending request for url (http://[fd00:1122:3344:10a::b]:32345/crucible/0/regions), retrying...
    file = common/src/progenitor_operation_retry.rs:121
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
    saga_name = instance-create
16:29:12.498Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): failed external call (ProgenitorError(Communication Error: error sending request for url (http://[fd00:1122:3344:10a::b]:32345/crucible/0/regions))), will retry in 137.416942ms
    file = common/src/progenitor_operation_retry.rs:174
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
    saga_name = instance-create
16:29:27.644Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): saw transient communication error error sending request for url (http://[fd00:1122:3344:10a::b]:32345/crucible/0/regions), retrying...
    file = common/src/progenitor_operation_retry.rs:121
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
    saga_name = instance-create

That zone did not come up properly:

BRM42220016 # svcs -Zxv
svc:/oxide/zone-network-setup:default (Oxide Zone Network Setup)
  Zone: oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c
 State: maintenance since Wed Oct 23 20:07:19 2024
Reason: Start method failed repeatedly, last exited with status 1.
   See: http://illumos.org/msg/SMF-8000-KS
   See: /pool/ext/86c58ea3-1413-4af3-9aff-9c0a3d758459/crypt/zone/oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c/root/var/svc/log/oxide-zone-network-setup:default.log
Impact: 1 dependent service is not running:
        svc:/oxide/crucible/agent:default

That log file is rotated out and there are subsequent empty ones but here's the last non-empty one:

BRM42220016 # cat /pool/ext/d4c6bdc6-5e99-4f6c-b57a-9bfcb9a76be4/crypt/debug/oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c/oxide-zone-network-setup:default.log.1729714039
[ Oct 23 20:07:09 Enabled. ]
[ Oct 23 20:07:09 Rereading configuration. ]
[ Oct 23 20:07:14 Rereading configuration. ]
[ Oct 23 20:07:14 Executing start method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d oxControlService10 -s fd00:1122:3344:10a::b -g fd00:1122:3344:10a::1"). ]
note: configured to log to "/dev/stderr"
{"msg":"Ensuring IP interface exists on datalink","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:16.142871834Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":10907,"datalink":"oxControlService10"}
{"msg":"Setting MTU to 9000 for IPv6 and IPv4","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:16.240552047Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":10907,"datalink":"oxControlService10"}
{"msg":"Setting TCP recv_buf size to 1 MB","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:16.489292596Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":10907}
{"msg":"Setting TCP congestion control algorithm to cubic","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:16.612879477Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":10907}
{"msg":"Ensuring static and auto-configured addresses are set on the IP interface","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:16.694530786Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":10907,"static address":"fd00:1122:3344:10a::b","data link":"\"oxControlService10\""}
zone-setup: failed to ensure static address fd00:1122:3344:10a::b on datalink oxControlService10

Caused by:
    Command [/usr/sbin/ipadm create-addr -t -T addrconf oxControlService10/ll] executed and failed with status: exit status: 1  stdout: 
      stderr: ipadm: Could not create address: Addrconf already in progress
[ Oct 23 20:07:18 Method "start" exited with status 1. ]
[ Oct 23 20:07:18 Executing start method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d oxControlService10 -s fd00:1122:3344:10a::b -g fd00:1122:3344:10a::1"). ]
note: configured to log to "/dev/stderr"
{"msg":"Ensuring IP interface exists on datalink","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:18.731988794Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11444,"datalink":"oxControlService10"}
{"msg":"Setting MTU to 9000 for IPv6 and IPv4","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:18.803297256Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11444,"datalink":"oxControlService10"}
{"msg":"Setting TCP recv_buf size to 1 MB","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:18.908654527Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11444}
{"msg":"Setting TCP congestion control algorithm to cubic","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:18.952228803Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11444}
{"msg":"Ensuring static and auto-configured addresses are set on the IP interface","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:19.012102346Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11444,"static address":"fd00:1122:3344:10a::b","data link":"\"oxControlService10\""}
zone-setup: failed to ensure static address fd00:1122:3344:10a::b on datalink oxControlService10

Caused by:
    Command [/usr/sbin/ipadm create-addr -t -T addrconf oxControlService10/ll] executed and failed with status: exit status: 1  stdout: 
      stderr: ipadm: Could not create address: Addrconf already in progress
[ Oct 23 20:07:19 Method "start" exited with status 1. ]
[ Oct 23 20:07:19 Executing start method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d oxControlService10 -s fd00:1122:3344:10a::b -g fd00:1122:3344:10a::1"). ]
note: configured to log to "/dev/stderr"
{"msg":"Ensuring IP interface exists on datalink","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:19.478810164Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11568,"datalink":"oxControlService10"}
{"msg":"Setting MTU to 9000 for IPv6 and IPv4","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:19.520545249Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11568,"datalink":"oxControlService10"}
{"msg":"Setting TCP recv_buf size to 1 MB","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:19.588973381Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11568}
{"msg":"Setting TCP congestion control algorithm to cubic","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:19.617077438Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11568}
{"msg":"Ensuring static and auto-configured addresses are set on the IP interface","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:19.659951889Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11568,"static address":"fd00:1122:3344:10a::b","data link":"\"oxControlService10\""}
zone-setup: failed to ensure static address fd00:1122:3344:10a::b on datalink oxControlService10

Caused by:
    Command [/usr/sbin/ipadm create-addr -t -T addrconf oxControlService10/ll] executed and failed with status: exit status: 1  stdout: 
      stderr: ipadm: Could not create address: Addrconf already in progress
[ Oct 23 20:07:19 Method "start" exited with status 1. ]

It seems like there are a few issues here:

I'm not sure if these are issues:

bnaecker commented 1 month ago

After a lot of debugging with @rcgoodfellow and @ahl, we believe this is caused by https://www.illumos.org/issues/15782.

The crux of the issue is that:

We've put a core from in.ndpd which proves the invalid state at /staff/core/omicron-6947/core.7593. From there, we can see that the phyint.pi_autoconf flag is set to true on an interface with the right name:

> phyints::print [0] ! grep -E '_name|_autoconf'
    [0].pi_name = [ "oxControlService10" ]
    [0].pi_autoconf = 0x1 (B_TRUE)

That causes this code in ndpd_create_addrs() to find that there already exists an interface and that autoconfiguration is already in progress for it. That's not correct. At this point, we had deleted the IP interface object, but in.ndpd still believes it is there:

root@oxz_crucible_26fb3830:~# ipadm show-if
IFNAME     CLASS     STATE    CURRENT      PERSISTENT
lo0        VIRTUAL   ok       -m-v------46 ---
oxControlService10 IP down    bm--------46 ---
root@oxz_crucible_26fb3830:~# ipadm delete-if oxControlService10
root@oxz_crucible_26fb3830:~# gcore 7593
gcore: core.7593 dumped
root@oxz_crucible_26fb3830:~# mdb core.7593
Loading modules: [ libc.so.1 ld.so.1 ]
> num_of_phyint::print
mdb: failed to dereference symbol: unknown symbol name
> num_of_phyints::print
0x1
> phyints::print [0]
[0] = {
    [0].pi_next = 0
    [0].pi_prev = 0
    [0].pi_prefix_list = 0
    [0].pi_router_list = 0
    [0].pi_adv_prefix_list = 0
    [0].pi_index = 0x2
    [0].pi_name = [ "oxControlService10" ]
    [0].pi_sock = 0xffffffff
    [0].pi_ifaddr = ::
    [0].pi_flags = 0x21002000840
    [0].pi_mtu = 0
    [0].pi_token = ::
    [0].pi_token_length = 0
    [0].pi_stateless = 0x1 (B_TRUE)
    [0].pi_stateful = 0x1 (B_TRUE)
    [0].pi_tmp_token = ::
    [0].pi_dst_token = ::
    [0].pi_state = 0
    [0].pi_kernel_state = 0
    [0].pi_num_k_routers = 0
    [0].pi_reach_time_since_random = 0
    [0].pi_adv_time_left = 0
    [0].pi_adv_time_since_sent = 0
    [0].pi_adv_state = 0 (NO_ADV)
    [0].pi_adv_count = 0
    [0].pi_sol_time_left = 0
    [0].pi_sol_state = 0 (0)
    [0].pi_sol_count = 0
    [0].pi_config = [
        {
            cf_value = 0x1
            cf_notdefault = 0 (0)
        },
        {
            cf_value = 0
            cf_notdefault = 0 (0)
        },
...

Most of the state here is cleared, but the pi_autoconf flag is not. That seems to be reused in error when we try to recreate the address object.

Again, we're not sure how the system got into this state, but we are pretty confident that there is some race that prevents in.ndpd from completely resetting its state, in particular the pi_autoconf field.

rcgoodfellow commented 1 month ago

A simple repro for this issue is

#!/bin/bash

set -e

i=0

while true; do
    svcadm restart ndp
    ipadm create-addr -t -T addrconf vnic0/ll
    ipadm delete-if vnic0
    i=$((i+1))
    printf "\r$i"
done

This reproduces as of helios-2.0.22997