bad time installing Omicron over Omicron on Gimlets

davepacheco commented 1 year ago

This ticket describes several issues that a bunch of us (@smklein, @leftwo, @jmpesp, @bnaecker, @Nieuwejaar, and I) debugged today. We'll file separate tickets for bugs and enhancements that comes out of it and then I expect we'll close this one. I just wanted a place to record the whole activity. I'm sure I've got some of the details here wrong -- please correct me! I didn't take good enough notes. On the plus side, there's a video recording for most of this process.

Summary: From initial conditions of a working Omicron on gimlet-sn05, @Nieuwejaar reinstalled Omicron. Omicron failed to come up because it failed to wipe the database. The proximate cause appears to us like a Cockroach bug, which I've filed. There were a few follow-on bugs that happened after this. We also identified some enhancements to make this easier to debug in the future.

Initial symptoms

Nils reported that after having Omicron running on gimlet-sn05, he tried to install Omicron again and it failed to come up. Sled Agent logged:

{"msg":"request completed","v":0,"name":"SledAgent","level":30,"time":"2000-01-02T03:10:13.792417073Z","hostname":"gimlet-sn05","pid":133494,"uri":"/filesystem","method":"PUT","req_id":"20dcbaaf-bf96-4e8c-9e83-5d759e2268e8","remote_addr":"[fd00:1122:3344:101::1]:53770","local_addr":"[fd00:1122:3344:101::1]:12345","component":"dropshot (SledAgent)","error_message_external":"Internal Server Error","error_message_internal":"Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbwipe.sql] executed and failed with status: exit status: 1  stdout: CREATE DATABASE\nCREATE ROLE\n  stderr: ERROR: job-row-insert: duplicate key value violates unique constraint \"primary\"\nSQLSTATE: 23505\nDETAIL: Key (id)=(32769) already exists.\nCONSTRAINT: primary\nERROR: job-row-insert: duplicate key value violates unique constraint \"primary\"\nSQLSTATE: 23505\nDETAIL: Key (id)=(32769) already exists.\nCONSTRAINT: primary\nFailed running \"sql\"\n","response_code":"500"}

formatted:

[2000-01-02T03:10:13.792417073Z]  INFO: SledAgent/dropshot (SledAgent)/133494 on gimlet-sn05: request completed (req_id=20dcbaaf-bf96-4e8c-9e83-5d759e2268e8, uri=/filesystem, method=PUT, remote_addr=[fd00:1122:3344:101::1]:53770, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
    error_message_internal: Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbwipe.sql] executed and failed with status: exit status: 1  stdout: CREATE DATABASE
    CREATE ROLE
      stderr: ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
    SQLSTATE: 23505
    DETAIL: Key (id)=(32769) already exists.
    CONSTRAINT: primary
    ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
    SQLSTATE: 23505
    DETAIL: Key (id)=(32769) already exists.
    CONSTRAINT: primary
    Failed running "sql"

Initial investigation: RSS/Sled Agent stuck in a loop

This command is part of the dbwipe.sql script that we run while setting up a zone. It's surprising that this would ever fail, as it's intended to be run against an arbitrary database that may or may not have been set up by Omicron already. I believe we even have tests that it's idempotent regardless of whether the database has been set up already. I used omicron-dev db-run and omicron-dev db-wipe to set up a transient CockroachDB cluster and verify that I could wipe it several times (without repopulating it).

It's not clear to me how the error message above has anything to do with the SQL we were trying to run. It appears to be complaining about some internal id, not something that was in the query.

Upon logging into gimlet-sn05, we found that there was no CockroachDB zone present. This is where my summary here diverges from the sequence we actually took. It will be clearer to present a summary that we put together from the chat log and sled agent log. I've filtered the raw sled agent log with:

grep -v 'failed to notify nexus' sled-agent.log | grep -v 'contacting server nexus' | grep -v 'failed to notify nexus about datasets'

to eliminate a bunch of messages irrelevant to this issue. My excerpts below are formatted with bunyan -o short.

Nils reported the problem around 8:10am PT, which I calculate to be about 03:27Z on the test system. Then he did an "uninstall" and "install" around 8:27 PT, or 03:44Z on this system. We see the latter in the sled agent log. The former must have been rotated to a previous log file. That's okay because we really only care about what happened after this second event. Sure enough, we have:

[ Jan  2 03:42:33 Enabled. ]
[ Jan  2 03:42:33 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
[ Jan  2 03:42:33 Method "start" exited with status 0. ]

As Omicron starts up, it eventually goes to set up CockroachDB:

03:44:30.195Z  INFO SledAgent/RSS: creating new filesystem: DatasetEnsureBody { id: 4d08fc19-3d5f-4f6b-9c48-925f8eac7255, zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221 }
03:44:30.204Z  INFO SledAgent/StorageManager: add_dataset: NewFilesystemRequest { zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221, responder: Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: true, is_tx_task_set: false } }) } }
03:44:30.207Z  INFO SledAgent/StorageManager: Ensuring dataset oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb exists
03:44:30.307Z  INFO SledAgent/StorageManager: Ensuring zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb is running
03:44:30.345Z  INFO SledAgent/StorageManager: Zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb was not found
03:44:30.419Z  INFO SledAgent/StorageManager: Configuring new Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:30.490Z  INFO SledAgent/StorageManager: Installing Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:44.692Z  INFO SledAgent/StorageManager: Zone booting
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:53.389Z  INFO SledAgent/StorageManager: Adding address: Static(V6(Ipv6Network { addr: fd00:1122:3344:101::2, prefix: 64 }))
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:54.336Z  INFO SledAgent/StorageManager: start_zone: Loading CRDB manifest
03:44:55.060Z  INFO SledAgent/StorageManager: start_zone: setting CRDB's config/listen_addr: [fd00:1122:3344:101::2]:32221
03:44:55.170Z  INFO SledAgent/StorageManager: start_zone: setting CRDB's config/store
03:44:55.275Z  INFO SledAgent/StorageManager: start_zone: setting CRDB's config/join_addrs
03:44:55.384Z  INFO SledAgent/StorageManager: start_zone: refreshing manifest
03:44:55.487Z  INFO SledAgent/StorageManager: start_zone: enabling CRDB service
03:44:55.577Z  INFO SledAgent/StorageManager: start_zone: awaiting liveness of CRDB
03:44:55.599Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:55.955Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:56.596Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:57.223Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:59.661Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:45:04.015Z  INFO SledAgent/StorageManager: CRDB is online
03:45:04.030Z  INFO SledAgent/StorageManager: Formatting CRDB
03:45:09.661Z  INFO SledAgent/StorageManager: halt_and_remove_logged: Previous zone state: Running
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:45:09.674Z  INFO SledAgent/StorageManager: Stopped and uninstalled zone
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:45:09.704Z  INFO SledAgent/dropshot (SledAgent): request completed (req_id=b12c1590-f9c9-43e8-a9d0-761ae0b93e0a, uri=/filesystem, method=PUT, remote_addr=[fd00:1122:3344:101::1]:50697, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
    error_message_internal: Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbwipe.sql] executed and failed with status: exit status: 1  stdout: CREATE DATABASE
    CREATE ROLE
      stderr: ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
    SQLSTATE: 23505
    DETAIL: Key (id)=(32769) already exists.
    CONSTRAINT: primary
    ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
    SQLSTATE: 23505
    DETAIL: Key (id)=(32769) already exists.
    CONSTRAINT: primary
    Failed running "sql"

03:45:09.727Z  WARN SledAgent/RSS: failed to create filesystem
    error: Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "b12c1590-f9c9-43e8-a9d0-761ae0b93e0a", "content-length": "124", "date": "Sun, 02 Jan 2000 03:45:09 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "b12c1590-f9c9-43e8-a9d0-761ae0b93e0a" }

Here, RSS decides to create a dataset for CockroachDB. Sled Agent receives the request and decides to create a zone for it. The zone boots, we wait for CockroachDB to be up, and then we send the "wipe" SQL and get this error.

We go through the whole sequence again starting at 03:45:10.070Z, then 03:45:49.558Z, then 03:46:29.764Z, then 03:47:16.045Z, and so on. I think we're in an exponential backoff. Each time, RSS decides to create a dataset, Sled Agent creates the zone, we get this error from CockroachDB, and then Sled Agent tears down the zone. This explains why when we got there, there was no zone present.

Attempting to reproduce the problem elsewhere

This felt like the sort of bug where we really wanted to interactive with a live, broken database to investigate its state.

As mentioned above, we had already tried to reproduce this problem from a clean slate via a sequence of "populate" and "wipe" operations against the database. This didn't work. We thought maybe there was some broken on-disk state. @jmpesp did a "zfs send" of the underlying ZFS dataset over to atrium, where he spun up an instance of CockroachDB to see if we could reproduce the problem. We did not reproduce it, though he did see some other strange behavior that I'll let him comment about because I don't have the details.

Much later, I tried the same thing. Starting from a point where we had a CockroachDB zone running on gimlet-sn21 in a state that could fairly reliably generate this error, I shut down CockroachDB. I brought it up and verified that I still saw the error. Then I shut it down again and took a snapshot. I sent this over to my test machine, ivanova, and used cargo run --bin=omicron-dev db-run --store-dir /rpool/crdb-90803-repro/ --no-populate to start up a CockroachDB instance on it. I was not able to reproduce the error, even running the exact same SQL that did reproduce the problem over on gimlet-sn21. This seems like a significant data point but I don't yet know what it means.

Attempting to reproduce the problem back on gimlet-sn05

I don't remember exactly why, but we decided to restart sled-agent on gimlet-sn05 because we believed there was nothing more we could learn from the current state and we thought this would trigger the same code path again. Then if we could pause things at the point where Sled Agent tried to halt the busted zone, we could log in and investigate. (I can't remember if I have the timeline right here -- but if this is when it was, then there was no CockroachDB zone running at this point and I'm not sure we realized this retry process was in exponential backoff. I think maybe I thought it had given up.) From this SMF log entry:

[ Jan  2 04:53:06 Method "start" exited with status 0. ]

I think we did this around 04:53Z. When we did this, we ran into a different problem: the StorageManager inside Sled Agent would find the CockroachDB ZFS dataset and attempt to find information about it inside /var/oxide, but there was none there:

04:54:04.634Z  INFO SledAgent/StorageManager: StorageWorker loading fs cockroachdb on zpool oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
04:54:04.681Z  INFO SledAgent/StorageManager: Loading Dataset from /var/oxide/d462a7f7-b628-40fe-80ff-4e4189e2d62b/d6b41137-7fb5-43a1-939a-08302ddfc95e.toml
04:54:04.696Z  WARN SledAgent/StorageManager: StorageWorker Failed to load dataset: Failed to perform I/O: read config for pool oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset DatasetName { pool_name: "oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b", dataset_name: "cockroachdb" } from "/var/oxide/d462a7f7-b628-40fe-80ff-4e4189e2d62b/d6b41137-7fb5-43a1-939a-08302ddfc95e.toml": No such file or directory (os error 2)

We'll file a separate bug with more details on this. The net result was that Sled Agent was not able to set up the CockroachDB zone. We'd expect at this point that RSS would have been telling it to create that zone. Having found it, it would have tried to use the one it already had. In that path, it would not run the dbwipe.sql again. So this didn't trigger the code path we wanted.

A bigger hammer

Again, at this point we're still trying to catch a live system in a state where we can poke around and reproduce the SQL error. At this point, Nils observed that rerunning omicron-package install seemed to trigger the problem somewhat reliably, so we decided to do that again. This would blow away a lot more state than what we'd been doing so far, but what we'd been doing so far wasn't working.

I used this DTrace enabling to try stop every "zoneadm" command on the zone in question:

dtrace -w -n 'exec-success/execname == "zoneadm" && strstr(curpsinfo->pr_psargs, "oxz_cockroach") != 0/{ trace(pid); stop(); system("pargs %d", pid); exit(0); }'

Some notes:

This took a bunch of trial and error to get right.
This traces all successful process execs, filters on the command "zoneadm" with "oxz_cockroach" (a substring of the zonename) in the first 80 characters of the command-line arguments. When this event happens, we print the pid, stop the process, dump its full arguments, and exit from the D script.
I nearly always use exit(0) with stop() because otherwise if I mess up the predicate, I could wind up stopping a zillion other processes (possibly including dtrace).
I wasn't able to filter on the "halt" subcommand because curpsinfo->pr_psargs only has the first 80 characters and it wasn't in the first 80 characters. Instead, whenever this fired, I'd look at the printed command-line arguments. If it wasn't halt, I'd use prun $pid to run the process again. If it was, we're in business: it means we've successfully stopped the process with the broken zone still running and we can log in and investigate.

We got this far and were able to dig in a little deeper. We wanted to get Nils unblocked, though, so we used the same omicron-package install step to reproduce this on gimlet-sn21 and released gimlet-sn05 back to Nils.

The behavior we found at this point seemed like it couldn't be correct CockroachDB behavior so we filed cockroachdb/cockroach#90803, which summarizes what we saw. This is when I tried the steps mentioned above trying to reproduce on ivanova.

We didn't dig too much further. We had the idea that maybe what's going on is unrelated to our query, but just that this query triggers an internal CockroachDB job and something about that mechanism is causing it to big a bogus id. We poked a bit with "show jobs" and the "jobs" table in the "system" database. I wanted to explore a bit more with the web console but we ran into cockroachdb/cockroach#79657.

Issues found

Bugs (possibly debatable):

1884
1886
1879
1885

And debugging enhancements:

oxidecomputer/helios#37
oxidecomputer/helios#38, oxidecomputer/looker#1, oxidecomputer/looker#2, oxidecomputer/looker#3, oxidecomputer/looker#4, oxidecomputer/looker#5
1881
1882
1598

davepacheco commented 1 year ago

Here is the sled-agent.log from gimlet-sn05.

leftwo commented 1 year ago

I attempted to reproduce this on folgers (a lab system, not a gimlet) using the same omicron installed on sn21 and have not been able to. folgers is running:

root@folgers:~# uname -a
SunOS folgers 5.11 xde-0-g03536bce00 i86pc i386 i86pc

with:

root@folgers:~# /root/alan/cpuid | grep Name   
Processor Name: AMD EPYC 7413 24-Core Processor

davepacheco commented 1 year ago

There was also some question about oxidecomputer/stlouis#185 on the scene. As I understand it: the kernel on sn21 did not have this fix, but @wesolows had applied a workaround to the running system and the zpools were recreated after that point. So we don't believe that's related. (As far as I know, this could have been in play on sn05 because it predate the fix and I don't know that the workaround was applied.)

leftwo commented 1 year ago

Meeting recording: https://drive.google.com/file/d/1_Ds2lbqD-lYc4oR2svn2GXFl-wQNr11x/view?usp=sharing Meeting chat log: https://drive.google.com/file/d/1R0GcWNFEnGTGHfmD6K6yU8tXxPL-qMLp/view?usp=sharing

leftwo commented 1 year ago

I was able to reproduce https://github.com/cockroachdb/cockroach/issues/90803 on sn21 after updating the ramdisk with the fix for https://github.com/oxidecomputer/stlouis/issues/185 We were pretty sure this was not another case of that, but now we are even more sure!

My Omicron version was 4be42b45142f58799c6a5444370bbdd75c0340de To reproduce (on gimlet-sn21) after already copying over your omicron install files. ./omicron-package uninstall zpool status Make a note of which pools you have that start with oxp and which disk they are attached to. remove any oxp_ zpools Note*: Don't remove rpool or any other non omicron pools. If in doubt, ask me. Re-create the zpools you just destroyed. ./omicron-package install Wait for all zones to come online. Then, install again: ./omicron-package install

leftwo commented 1 year ago

I did some 900 loops of ./omicron-package install on folgers and never saw the cockroach issue like I do on gimlet-sn21

davepacheco commented 1 year ago

I put a copy of (what I believe is) the ZFS snapshot of the busted CockroachDB zone into catacomb:/data/staff/core/issues/omicron-1880.

oxidecomputer / omicron