oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
239 stars 34 forks source link

RSS stuck for single-node non-gimlet deployment due to missing zpool? #4202

Open jordanhendricks opened 10 months ago

jordanhendricks commented 10 months ago

I don't quite understand what's going on here, but I want to teardown this deployment and try standing up omicron again so I'm not blocked, so I'm dumping some state for now.

I deployed omicron on dunkin, a lab machine. The stack never quite came up, with this message repeatedly showing up from sled agent:

    file = sled-agent/src/rack_setup/service.rs:627
18:07:10.200Z INFO SledAgent (RSS): Failed to handoff to nexus: Error Response: status: 404 Not Found; headers: {"content-type": "application/json", "x-request-id": "09f78537-e5f6-4452-82d2-280e443e9e69", "content-length": "174", "date": "Wed, 04 Oct 2023 18:07:09 GMT"}; value: Error { error_code: Some("ObjectNotFound"), message: "not found: zpool with id \"616b26df-e62a-4c68-b506-f4a923d8aaf7\"", request_id: "09f78537-e5f6-4452-82d2-280e443e9e69" }

That zpool exists on this system:

$ zpool status oxp_616b26df-e62a-4c68-b506-f4a923d8aaf7
  pool: oxp_616b26df-e62a-4c68-b506-f4a923d8aaf7
 state: ONLINE
  scan: none requested
config:

        NAME                                                                           STATE     READ WRITE CKSUM
        oxp_616b26df-e62a-4c68-b506-f4a923d8aaf7                                       ONLINE       0     0     0
          /home/jordan/omicron/tools/../oxp_616b26df-e62a-4c68-b506-f4a923d8aaf7.vdev  ONLINE       0     0     0

errors: No known data errors

I looked for evidence of this request on the 3 nexuses in my deployment, and couldn't find a matching request (I looked for request ID and also just grepped for that zpool uuid). That log line comes from here:

https://github.com/oxidecomputer/omicron/blob/a9104a0786f437e38d4945dd77470bc8fd4f974c/sled-agent/src/rack_setup/service.rs#L608-L635

jordanhendricks commented 10 months ago

Looking through the logs post hoc, it looks like the handoff did eventually complete, about a minute later. These were the messages in between the last error from above and the completion:

{"msg":"Received prefixes from ddmd","v":0,"name":"SledAgent","level":30,"time":"2023-10-04T18:07:32.745866852Z","hostname":"dunkin","pid":4730,"DdmAdminClient":"[::1]:8000","component":"BootstrapAgentStartup","file":"ddm-admin-client/src/lib.rs:119","prefixes":"{\"fe80::d433:44ff:fe50:3ca8\": [PathVector { destination: Ipv6Prefix { addr: fd00:99::, len: 64 }, path: [\"oxz_switch\"] }]}"}                                   
{"msg":"accepted connection","v":0,"name":"SledAgent","level":30,"time":"2023-10-04T18:07:35.942868135Z","hostname":"dunkin","pid":4730,"local_addr":"[fd00:1122:3344:101::1]:12345","component":"dropshot (SledAgent)","file":"/home/jordan/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:769","remote_addr":"[fd00:1122:3344:101::c]:51363"}                                                            
{"msg":"Ensuring VPC firewall rules","v":0,"name":"SledAgent","level":30,"time":"2023-10-04T18:07:35.948501454Z","hostname":"dunkin","pid":4730,"component":"PortManager","file":"illumos-utils/src/opte/port_manager.rs:393","rules":"[VpcFirewallRule { status: Enabled, direction: Inbound, targets: [NetworkInterface { id: ccdd720a-c447-4071-8608-55a709f5a4e4, kind: Service { id: cd2d9f6e-47c7-45d8-9a59-87bb81dcf67f }, name: Name(\"external-dns-cd2d9f6e-47c7-45d8-9a59-87bb81dcf67f\"), ip: 172.30.1.5, mac: MacAddr(MacAddr6([168, 64, 37, 255, 221, 180])), subnet: V4(Ipv4Net(Ipv4Network { addr: 172.30.1.0, prefix: 24 })), vni: Vni(100), primary: true, slot: 1 }, NetworkInterface { id: f492fafb-57e1-4f06-8118-68f1400adf24, kind: Service { id: b0047890-c324-4074-94e4-4081355481d6 }, name: Name(\"external-dns-b0047890-c324-4074-94e4-4081355481d6\"), ip: 172.30.1.6, mac: MacAddr(MacAddr6([168, 64, 37, 255, 208, 46])), subnet: V4(Ipv4Net(Ipv4Network { addr: 172.30.1.0, prefix: 24 })), vni: Vni(100), primary: true, slot: 1 }], filter_hosts: None, filter_ports: Some([L4PortRange { first: L4Port(53), last: L4Port(53) }]), filter_protocols: Some([Udp]), action: Allow, priority: VpcFirewallRulePriority(65534) }, VpcFirewallRule { status: Enabled, direction: Inbound, targets: [NetworkInterface { id: 769d8337-f5ba-4b32-852b-f5281f24e60f, kind: Service { id: 9c0ec56b-4e51-4350-bd7f-0af5c478f755 }, name: Name(\"nexus-9c0ec56b-4e51-4350-bd7f-0af5c478f755\"), ip: 172.30.2.5, mac: MacAddr(MacAddr6([168, 64, 37, 255, 225, 195])), subnet: V4(Ipv4Net(Ipv4Network { addr: 172.30.2.0, prefix: 24 })), vni: Vni(100), primary: true, slot: 1 }, NetworkInterface { id: 9fa8a9e7-4a24-405f-8664-5bd7af0ba5fc, kind: Service { id: 52f2ebeb-0edd-46af-bc40-7f6cfe578e82 }, name: Name(\"nexus-52f2ebeb-0edd-46af-bc40-7f6cfe578e82\"), ip: 172.30.2.6, mac: MacAddr(MacAddr6([168, 64, 37, 255, 176, 25])), subnet: V4(Ipv4Net(Ipv4Network { addr: 172.30.2.0, prefix: 24 })), vni: Vni(100), primary: true, slot: 1 }, NetworkInterface { id: bca51406-c1e2-4ff4-bedc-1a5b34803a81, kind: Service { id: 45e6a84a-e1f1-4baf-bcc7-002b0bb4edc1 }, name: Name(\"nexus-45e6a84a-e1f1-4baf-bcc7-002b0bb4edc1\"), ip: 172.30.2.7, mac: MacAddr(MacAddr6([168, 64, 37, 255, 136, 149])), subnet: V4(Ipv4Net(Ipv4Network { addr: 172.30.2.0, prefix: 24 })), vni: Vni(100), primary: true, slot: 1 }], filter_hosts: None, filter_ports: Some([L4PortRange { first: L4Port(80), last: L4Port(80) }, L4PortRange { first: L4Port(443), last: L4Port(443) }]), filter_protocols: Some([Tcp]), action: Allow, priority: VpcFirewallRulePriority(65534) }]","vni":"Vni(100)"}                                                                                                                                                           {"msg":"Setting OPTE firewall rules","v":0,"name":"SledAgent","level":30,"time":"2023-10-04T18:07:35.948581756Z","hostname":"dunkin","pid":4730,"component":"PortManager","file":"illumos-utils/src/opte/port_manager.rs:412","rules":"[FirewallRule { direction: In, filters: Filters { hosts: Any, protocol: Proto(TCP), ports: PortList([80, 443]) }, action: Allow, priority: 65534 }]","port":"\"opte5\""}                           
{"msg":"Setting OPTE firewall rules","v":0,"name":"SledAgent","level":30,"time":"2023-10-04T18:07:35.948622956Z","hostname":"dunkin","pid":4730,"component":"PortManager","file":"illumos-utils/src/opte/port_manager.rs:412","rules":"[FirewallRule { direction: In, filters: Filters { hosts: Any, protocol: Proto(TCP), ports: PortList([80, 443]) }, action: Allow, priority: 65534 }]","port":"\"opte1\""}                           
{"msg":"Setting OPTE firewall rules","v":0,"name":"SledAgent","level":30,"time":"2023-10-04T18:07:35.948652637Z","hostname":"dunkin","pid":4730,"component":"PortManager","file":"illumos-utils/src/opte/port_manager.rs:412","rules":"[FirewallRule { direction: In, filters: Filters { hosts: Any, protocol: Proto(TCP), ports: PortList([80, 443]) }, action: Allow, priority: 65534 }]","port":"\"opte3\""}                           
{"msg":"Setting OPTE firewall rules","v":0,"name":"SledAgent","level":30,"time":"2023-10-04T18:07:35.948679528Z","hostname":"dunkin","pid":4730,"component":"PortManager","file":"illumos-utils/src/opte/port_manager.rs:412","rules":"[FirewallRule { direction: In, filters: Filters { hosts: Any, protocol: Proto(UDP), ports: PortList([53]) }, action: Allow, priority: 65534 }]","port":"\"opte4\""}                                
{"msg":"Setting OPTE firewall rules","v":0,"name":"SledAgent","level":30,"time":"2023-10-04T18:07:35.948762619Z","hostname":"dunkin","pid":4730,"component":"PortManager","file":"illumos-utils/src/opte/port_manager.rs:412","rules":"[]","port":"\"opte0\""}                                                                                                                                                                            
{"msg":"Setting OPTE firewall rules","v":0,"name":"SledAgent","level":30,"time":"2023-10-04T18:07:35.94882059Z","hostname":"dunkin","pid":4730,"component":"PortManager","file":"illumos-utils/src/opte/port_manager.rs:412","rules":"[FirewallRule { direction: In, filters: Filters { hosts: Any, protocol: Proto(UDP), ports: PortList([53]) }, action: Allow, priority: 65534 }]","port":"\"opte2\""}                                 
{"msg":"request completed","v":0,"name":"SledAgent","level":30,"time":"2023-10-04T18:07:35.948854711Z","hostname":"dunkin","pid":4730,"uri":"/vpc/001de000-074c-4000-8000-000000000000/firewall/rules","method":"PUT","req_id":"267a4bbd-12c7-4a95-a1de-2e75083b78e7","remote_addr":"[fd00:1122:3344:101::c]:51363","local_addr":"[fd00:1122:3344:101::1]:12345","component":"dropshot (SledAgent)","file":"/home/jordan/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:853","latency_us":5767,"response_code":"204"}                                                                                                           {"msg":"Handoff to Nexus is complete","v":0,"name":"SledAgent","level":30,"time":"2023-10-04T18:07:36.139567766Z","hostname":"dunkin","pid":4730,"component":"RSS","file":"sled-agent/src/rack_setup/service.rs:637"}

I'm not sure what the expected behavior here is with regard to the error and retry pattern, but #1881 feels increasingly critical to me to be able to debug standing up omicron.