Failure to grant database privileges in CockroachDB on startup

bnaecker commented 2 years ago

I saw an internal server error installing the control plane this evening. I built and installed Omicron using omicron-package, building from a local commit merging #1298 and the current main, which is at 154a4a6cd623497cbe893c2cd3ea1c241516bc21. From a fresh reboot I:

pfexec ./tools/create_virtual_hardware.sh
cargo build --bin omicron-package --release
./target/release/omicron-package package
pfexec ./target/release/omicron-package install

Looking at the sled agent logs, I saw:

[2022-06-29T03:33:29.08339844Z]  INFO: SledAgent/StorageManager/6346 on feldspar: CRDB is online
[2022-06-29T03:33:29.084297094Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Formatting CRDB
[2022-06-29T03:33:35.983551762Z]  INFO: SledAgent/StorageManager/6346 on feldspar: halt_and_remove_logged: Previous zone state: Running
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:35.986214134Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Stopped and uninstalled zone
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:35.993636362Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus about datasets, will retry in 327.794992ms
[2022-06-29T03:33:35.994011147Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus, will retry in 1.07985318s
[2022-06-29T03:33:35.994340186Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus about datasets, will retry in 471.061865ms
[2022-06-29T03:33:35.994666535Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus about datasets, will retry in 488.023497ms
[2022-06-29T03:33:35.995017057Z]  WARN: SledAgent/StorageManager/6346 on feldspar: failed to notify nexus, will retry in 925.598032ms
[2022-06-29T03:33:35.995311707Z]  INFO: SledAgent/dropshot (SledAgent)/6346 on feldspar: request completed (req_id=ed6c07d9-c91a-44ec-aa51-d7948d830391, uri=/filesystem, method=PUT, remote_addr=[fd00:1122:3344:101::1]:37061, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
    error_message_internal: Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbinit.sql] executed and failed with status: exit status: 1. Stdout: CREATE DATABASE
    CREATE ROLE
    GRANT
    CREATE TABLE
    CREATE TABLE
    CREATE INDEX
    CREATE TYPE
    CREATE TABLE
    CREATE INDEX
    CREATE TABLE
    CREATE TYPE
    CREATE TABLE
    CREATE INDEX
    CREATE TABLE
    CREATE INDEX
    CREATE TABLE
    CREATE TYPE
    CREATE TABLE
    CREATE INDEX
    CREATE TABLE
    CREATE INDEX
    CREATE TYPE
    CREATE TABLE
    CREATE INDEX
    CREATE INDEX
    CREATE TABLE
    , Stderr: NOTICE: GRANT INSERT, SELECT, UPDATE, DELETE ON DATABASE is deprecated.
    This statement was automatically converted to USE omicron; ALTER DEFAULT PRIVILEGES GRANT INSERT, SELECT, UPDATE, DELETE ON TABLES TO omicron;
    Please use ALTER DEFAULT PRIVILEGES going forward
    ERROR: driver: bad connection
    ERROR: driver: bad connection
    Failed running "sql"

[2022-06-29T03:33:35.996129256Z]  WARN: SledAgent/RSS/6346 on feldspar: failed to create filesystem (error="Error Response")
[2022-06-29T03:33:36.64487338Z]  INFO: SledAgent/RSS/6346 on feldspar: creating new filesystem: DatasetEnsureBody { id: 4d08fc19-3d5f-4f6b-9c48-925f8eac7255, zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221 }
[2022-06-29T03:33:36.647081411Z]  INFO: SledAgent/StorageManager/6346 on feldspar: add_dataset: NewFilesystemRequest { zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221, responder: Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: true, is_tx_task_set: false } }) } }
[2022-06-29T03:33:36.647918134Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Ensuring dataset oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb exists
[2022-06-29T03:33:36.66757923Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Ensuring zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb is running
[2022-06-29T03:33:36.679000179Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb was not found
[2022-06-29T03:33:36.695774353Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Configuring new Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:36.715740835Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Installing Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:39.927704567Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Zone booting
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:41.969687215Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Adding address: Static(V6(Ipv6Network { addr: fd00:1122:3344:101::2, prefix: 64 }))
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
[2022-06-29T03:33:42.536102317Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: Loading CRDB manifest
[2022-06-29T03:33:42.905934849Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: setting CRDB's config/listen_addr: [fd00:1122:3344:101::2]:32221
[2022-06-29T03:33:42.94067554Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: setting CRDB's config/store
[2022-06-29T03:33:42.971693725Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: setting CRDB's config/join_addrs
[2022-06-29T03:33:42.998374003Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: refreshing manifest
[2022-06-29T03:33:43.023893849Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: enabling CRDB service
[2022-06-29T03:33:43.047899489Z]  INFO: SledAgent/StorageManager/6346 on feldspar: start_zone: awaiting liveness of CRDB
[2022-06-29T03:33:43.05261359Z]  WARN: SledAgent/6346 on feldspar: failed to contact nexus, will retry in 1.316873734s
[2022-06-29T03:33:43.052927673Z]  WARN: SledAgent/StorageManager/6346 on feldspar: cockroachdb not yet alive
[2022-06-29T03:33:43.251247719Z]  WARN: SledAgent/StorageManager/6346 on feldspar: cockroachdb not yet alive
[2022-06-29T03:33:43.520264543Z]  WARN: SledAgent/StorageManager/6346 on feldspar: cockroachdb not yet alive
[2022-06-29T03:33:44.370186624Z]  INFO: SledAgent/6346 on feldspar: contacting server nexus, registering sled: fb0f7546-4d46-40ca-9d56-cbb810684ca7
[2022-06-29T03:33:44.472806569Z]  WARN: SledAgent/StorageManager/6346 on feldspar: cockroachdb not yet alive
[2022-06-29T03:33:45.822259473Z]  INFO: SledAgent/StorageManager/6346 on feldspar: CRDB is online
[2022-06-29T03:33:45.823833209Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Formatting CRDB
[2022-06-29T03:33:58.271956658Z]  INFO: SledAgent/StorageManager/6346 on feldspar: Formatting CRDB - Completed

The relevant bit there is the long list of database permissions. The sled agent starts up CRDB, then runs dbinit.sql to initialize the database and schema. That includes a statement granting permissions to operate on the tables in the database. That syntax appears outdated as of https://github.com/cockroachdb/cockroach/issues/73065. The sled agent apparently gets a non-zero exit code from the subprocess running that SQL file, since it tears down the zone and tries again. The second attempt succeeds. I'm not sure why there would be a difference.

We should probably adopt the recommended action which is to use ALTER DEFAULT PRIVILEGES instead.

smklein commented 2 years ago

A similar change broke the CLI github action, because CRDB was not pinned to a specific version: https://github.com/oxidecomputer/cli/pull/203

While I think we can - and should - update the SQL statement, I don't think we updated the revision of CRDB being used by Omicron recently. Do you have any idea why this failure would occur, if we're still using a months-old version of cockroach?

smklein commented 2 years ago

FWIW, on Linux, I upgrade my rev of CRDB to 22.1.2, and saw this issue. But this required me to explicitly modify ./tools/cockroach_version, and the corresponding checksum.

davepacheco commented 2 years ago

The error message implies that the syntax change was handled by Cockroach without issue. It looks to me like the real problem is:

    ERROR: driver: bad connection
    ERROR: driver: bad connection
    Failed running "sql"

davepacheco commented 2 years ago

@bnaecker is it possible that cockroach died during this phase? You might check the stdout/stderr and the log files it writes to. Alternatively, you could look at the SMF log for CockroachDB to see if SMF restarted it due to processes exiting.

bnaecker commented 2 years ago

@davepacheco I noticed that re-reading the error message. It does seem like two distinct failures, since the GRANT usage is deprecated, not incorrect. I'm not sure the SMF log still exists, but I'll try to find it.

bnaecker commented 2 years ago

@davepacheco I'm actually not sure we have Cockroach's logfiles at this point. When sled-agent noticed this failure, it tore down the zone and tried again. I think that removes any extant state of the zone, including whatever log files Cockroach was writing, right?

davepacheco commented 2 years ago

Reopening this because I don't think #1305 fixed the real issue here.

davepacheco commented 2 years ago

@davepacheco I'm actually not sure we have Cockroach's logfiles at this point. When sled-agent noticed this failure, it tore down the zone and tried again. I think that removes any extant state of the zone, including whatever log files Cockroach was writing, right?

If it destroys the ZFS dataset that was used for the CockroachDB files, then yes, it would have destroyed the evidence we'd want for further investigation here.

I'd suggest that if zone setup fails and we want to retry it (which I'm not sure is a good idea, but might be), we should at least archive the on-disk state so we can debug problems like this (which will of course happen in the field too). We may also want to save core files of any processes running in the zone or the ones of interest to us. @smklein I don't know enough about how this works today -- would that be a reasonable issue to file?

oxidecomputer / omicron

Failure to grant database privileges in CockroachDB on startup #1304