oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 38 forks source link

Nexus handoff failed - (loopback) "address unavailable" #3616

Closed citrus-it closed 1 year ago

citrus-it commented 1 year ago

During dogfood mupdate today, the final nexus handoff failed:

10:29:10.097Z INFO SledAgent (RSS): Failed to handoff to nexus: Error Response:
status: 400 Bad Request; headers: {"content-type": "application/json",
"x-request-id": "035ad549-d9d7-4a0f-9aa4-e047b17bb160", "content-length": "128",
"date": "Fri, 14 Jul 2023 10:29:10 GMT"}; value: Error { error_code: Some("InvalidRequest"),
message: "address unavailable", request_id: "035ad549-d9d7-4a0f-9aa4-e047b17bb160" }

and the nexus logs contained:

10:41:09.003Z INFO 36ed3232-caf0-442b-990e-01d0c0506f1a (dropshot_internal): Early exit: Rack already initialized
    resource = Vpc { parent: Project { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000001, lookup_type: ById(001de000-5110-4000-8000-000000000001) }, key: 001de000-4401-4000-8000-000000000000, lookup_type: ById(001de000-4401-4000-8000-000000000000) }, key: 001de000-074c-4000-8000-000000000000, lookup_type: ById(001de000-074c-4000-8000-000000000000) }
    resource = Vpc { parent: Project { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000001, lookup_type: ById(001de000-5110-4000-8000-000000000001) }, key: 001de000-4401-4000-8000-000000000000, lookup_type: ById(001de000-4401-4000-8000-000000000000) }, key: 001de000-074c-4000-8000-000000000000, lookup_type: ById(001de000-074c-4000-8000-000000000000) }
    resource = Vpc { parent: Project { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000001, lookup_type: ById(001de000-5110-4000-8000-000000000001) }, key: 001de000-4401-4000-8000-000000000000, lookup_type: ById(001de000-4401-4000-8000-000000000000) }, key: 001de000-074c-4000-8000-000000000000, lookup_type: ById(001de000-074c-4000-8000-000000000000) }
    resource = VpcSubnet { parent: Vpc { parent: Project { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000001, lookup_type: ById(001de000-5110-4000-8000-000000000001) }, key: 001de000-4401-4000-8000-000000000000, lookup_type: ById(001de000-4401-4000-8000-000000000000) }, key: 001de000-074c-4000-8000-000000000000, lookup_type: ByName("oxide-services") }, key: 001de000-c470-4000-8000-000000000001, lookup_type: ByName("external-dns") }
    resource = VpcSubnet { parent: Vpc { parent: Project { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000001, lookup_type: ById(001de000-5110-4000-8000-000000000001) }, key: 001de000-4401-4000-8000-000000000000, lookup_type: ById(001de000-4401-4000-8000-000000000000) }, key: 001de000-074c-4000-8000-000000000000, lookup_type: ByName("oxide-services") }, key: 001de000-c470-4000-8000-000000000001, lookup_type: ByName("external-dns") }
    resource = VpcSubnet { parent: Vpc { parent: Project { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000001, lookup_type: ById(001de000-5110-4000-8000-000000000001) }, key: 001de000-4401-4000-8000-000000000000, lookup_type: ById(001de000-4401-4000-8000-000000000000) }, key: 001de000-074c-4000-8000-000000000000, lookup_type: ByName("oxide-services") }, key: 001de000-c470-4000-8000-000000000002, lookup_type: ByName("nexus") }
    resource = VpcSubnet { parent: Vpc { parent: Project { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000001, lookup_type: ById(001de000-5110-4000-8000-000000000001) }, key: 001de000-4401-4000-8000-000000000000, lookup_type: ById(001de000-4401-4000-8000-000000000000) }, key: 001de000-074c-4000-8000-000000000000, lookup_type: ByName("oxide-services") }, key: 001de000-c470-4000-8000-000000000002, lookup_type: ByName("nexus") }
    resource = AddressLot { parent: Fleet, key: a77d92ed-7ff7-4794-9411-257076800abe, lookup_type: ByName("initial-infra") }
    resource = LoopbackAddress { parent: Fleet, key: 440dc262-4948-4e3f-980c-c30527e582bc, lookup_type: ByCompositeId("address = V6(Ipv6Network { addr: fd00:99::1, prefix: 64 }), rack_id = 0482465f-ee67-48a7-a18f-874879408e14, switch_location = \\"switch0\\"") }
    error_message_external = address unavailable
    error_message_internal = address unavailable
    response_code = 400

The address unavailable error seems to be due to the attempted assignment of the anycast fd00:99::1 address to switch1 when it is already in use on switch0. This is of course fine but not understood by the current logic:

root@[fd00:1122:3344:108::3]:32221/omicron> select first_address, last_address from address_lot_block;
  first_address | last_address
----------------+----------------
  172.20.15.21  | 172.20.15.22
  fd00:99::1    | fd00:99::ffff
(2 rows)

Time: 2ms total (execution 2ms / network 0ms)

root@[fd00:1122:3344:108::3]:32221/omicron> select first_address, last_address from address_lot_rsvd_block;
  first_address | last_address
----------------+---------------
  fd00:99::1    | fd00:99::1
  172.20.15.21  | 172.20.15.21

I manually deleted the fd00:99::1 address from the reserved block table after which RSS completed.

internet-diglett commented 1 year ago

Closed by https://github.com/oxidecomputer/omicron/pull/3626