oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
239 stars 36 forks source link

Nexus busted after its initial startup raced with another Nexus populating "system" VpcRouter #5980

Open davepacheco opened 2 months ago

davepacheco commented 2 months ago

Here's the CI failure: https://github.com/oxidecomputer/omicron/pull/5962/checks?check_run_id=26826135569

This PR is non-trivial and it's conceivable that my changes introduced this but I don't yet see how. The helios-deploy job failed in the usual way (timing out after 600s trying to log in):

2024-06-28 22:19:52.051692811 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: error:0A000419:SSL routines:ssl3_read_bytes:tlsv1 alert access denied:ssl/record/rec_layer_s3.c:1605:SSL alert number 49
Error: logging in

Caused by:
    timed out after 600.598279588s

That TLS error can reflect that there are no certificates, and so a problem with the "external endpoints" task. I went to the Nexus logs to look for errors and found that one Nexus failed some of the first-time-setup steps:

2024-06-28T22:10:18.588Z    ERRO    nexus (DataLoader): gave up trying to populate built-in PopulateBuiltinVpcs
error_message = InternalError { internal_message: "Unknown diesel error creating VpcRouter called \\"system\\": Record not found" }
file = nexus/src/populate.rs:126
2024-06-28T22:10:18.588Z    ERRO    nexus: populate failed
file = nexus/src/app/mod.rs:528
2024-06-28T22:10:18.588Z    ERRO    nexus: saga request channel closed!
file = nexus/src/app/mod.rs:544

One possibility here is that before my PR, we would still have run the background tasks that eventually configure TLS, whereas now we won't. But at best the old code would have been papering over this problem.

davepacheco commented 2 months ago

The Unknown diesel error creating VpcRouter called \\"system\\ error is coming from here: https://github.com/oxidecomputer/omicron/blob/e514bc8262f404f1d28b7b5e3ef39712dccb81f2/nexus/db-queries/src/db/error.rs#L248

and the cause is a NotFound error that's raised when we use something like get_result to execute the query, which we are doing here: https://github.com/oxidecomputer/omicron/blob/e514bc8262f404f1d28b7b5e3ef39712dccb81f2/nexus/db-queries/src/db/datastore/vpc.rs#L1090

The stack is:

It's a little surprising for an INSERT to return NotFound. But we are expecting a row because we're using get_result_async(). And we're expecting that row because we're using INSERT ... RETURNING ..., which returns the newly-inserted row. That we got 0 rows here tells us that we didn't insert it. That can only happen for this query if there was another row with the same id. This is a totally normal and expected condition if some other Nexus already went through the populate step. Now, we already have tests for populators being idempotent, so my guess is that normally if this has already been inserted, then we won't make it this far into load_builtin_vpcs(), and what actually happened here was a race. I'll look for more evidence of that.

davepacheco commented 2 months ago

Yeah, I think we have Nexus instances racing. Here's the point in three different Nexus logs where we started this populator:

The first of these Nexus instances went on to the next populate step, so it must have succeeded. The second and third ones both gave up.

The timeline is:

2024-06-28T22:10:18.480Z Nexus 1: "attempting to create built-in VPCs"
2024-06-28T22:10:18.504Z Nexus 2: "attempting to create built-in VPCs"
2024-06-28T22:10:18.510Z Nexus 3: "attempting to create built-in VPCs"
2024-06-28T22:10:18.584Z Nexus 2 reports the "not found" error
2024-06-28T22:10:18.588Z Nexus 3 reports the "not found" error
2024-06-28T22:10:18.664Z Nexus 1: "attempting to create built-in VPC firewall rules", which implies it had finished creating the built-in VPCs

This is not a smoking gun (I don't think that's possible without more fine-grained logging from the VPC populate steps) but it's consistent with the explanation that these raced with each other.

In this case, prior to #5962, the impact would be:

I think there are three potential takeaways here:

davepacheco commented 2 months ago

I found another example where we do the ON CONFLICT SET id = id for this exact reason (I think! There's not much explanation in the comment.): https://github.com/oxidecomputer/omicron/blob/653964be121425c55955a6e4f58d9e1b8cda83da/nexus/db-queries/src/db/datastore/rack.rs#L236-L237