oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 38 forks source link

test failure: "Failed to find a type oid for ..." #1219

Open davepacheco opened 2 years ago

davepacheco commented 2 years ago

While trying to repro #1130, I ran into a failure I hadn't seen before and can't find a ticket for:

running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails ... FAILED

failures:

---- integration_tests::disks::test_disk_create_disk_that_already_exists_fails stdout ----
log file: "/dangerzone/omicron_tmp/try_repro.13782/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.17780.0.log"
note: configured to log to "/dangerzone/omicron_tmp/try_repro.13782/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.17780.0.log"
thread 'integration_tests::disks::test_disk_create_disk_that_already_exists_fails' panicked at 'Failed to notify Nexus about new Dataset: Error Response', /home/dap/omicron/sled-agent/src/sim/storage.rs:240:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/try_repro.13782/.tmpSGA0LA
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)

failures:
    integration_tests::disks::test_disk_create_disk_that_already_exists_fails

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 74 filtered out; finished in 5.93s

Log file:

$ bunyan "/dangerzone/omicron_tmp/try_repro.13782/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.17780.0.log"
[2022-06-15T22:22:04.900085563Z]  INFO: test_disk_create_disk_that_already_exists_fails/17780 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/try_repro.13782/.tmpSGA0LA
[2022-06-15T22:22:04.900451226Z]  INFO: test_disk_create_disk_that_already_exists_fails/17780 on ivanova: cockroach: copying from seed directory (/home/dap/omicron/target/debug/build/nexus-test-utils-308521ed0d0eed98/out/crdb-base) to storage directory (/dangerzone/omicron_tmp/try_repro.13782/.tmpSGA0LA/data)
[2022-06-15T22:22:04.907769128Z]  INFO: test_disk_create_disk_that_already_exists_fails/17780 on ivanova: cockroach command line: cockroach start-single-node --insecure --http-addr=:0 --store=path=/dangerzone/omicron_tmp/try_repro.13782/.tmpSGA0LA/data,ballast-size=0 --listen-addr 127.0.0.1:0 --listening-url-file /dangerzone/omicron_tmp/try_repro.13782/.tmpSGA0LA/listen-url
[2022-06-15T22:22:06.627376916Z]  INFO: test_disk_create_disk_that_already_exists_fails/17780 on ivanova: cockroach pid: 17785
[2022-06-15T22:22:06.628304134Z]  INFO: test_disk_create_disk_that_already_exists_fails/17780 on ivanova: cockroach listen URL: postgresql://root@127.0.0.1:64766/omicron?sslmode=disable
...
[2022-06-15T22:22:10.628406846Z]  INFO: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/17780 on ivanova: Created Simulated Crucible Server (kind=storage, server=b6d65341-167c-41df-9b5c-41cded99c229, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, address=[::1]:61800)
[2022-06-15T22:22:10.628731621Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/17780 on ivanova: client request (sled_id=b6d65341-167c-41df-9b5c-41cded99c229, body=Some(Body), method=PUT)
    uri: http://127.0.0.1:36197/zpools/2cfecc89-e67a-4b10-b50c-ec9c130187c2/dataset/7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c
[2022-06-15T22:22:10.629066446Z] TRACE: 5eb2887c-0c0e-4a74-87fc-60a737cec391/dropshot_internal/17780 on ivanova: incoming request (req_id=65465297-194a-47b9-97a7-a49519f3fb06, method=PUT, remote_addr=127.0.0.1:52635, local_addr=127.0.0.1:36197)
    uri: /zpools/2cfecc89-e67a-4b10-b50c-ec9c130187c2/dataset/7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c
[2022-06-15T22:22:10.629428382Z]  INFO: 5eb2887c-0c0e-4a74-87fc-60a737cec391/ServerContext/17780 on ivanova: upserting dataset (address=[::1]:61800, dataset_id=7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c, zpool_id=2cfecc89-e67a-4b10-b50c-ec9c130187c2)
[2022-06-15T22:22:10.792935278Z]  INFO: 5eb2887c-0c0e-4a74-87fc-60a737cec391/dropshot_internal/17780 on ivanova: request completed (req_id=65465297-194a-47b9-97a7-a49519f3fb06, method=PUT, remote_addr=127.0.0.1:52635, local_addr=127.0.0.1:36197, error_message_external="Internal Server Error", response_code=500)
    uri: /zpools/2cfecc89-e67a-4b10-b50c-ec9c130187c2/dataset/7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c
    --
    error_message_internal: Unknown diesel error creating Dataset called "7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c": Failed to find a type oid for `dataset_kind`
[2022-06-15T22:22:10.793844962Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/17780 on ivanova: client response (sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
    result: Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(36197), path: "/zpools/2cfecc89-e67a-4b10-b50c-ec9c130187c2/dataset/7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c", query: None, fragment: None }, status: 500, headers: {"content-type": "application/json", "x-request-id": "65465297-194a-47b9-97a7-a49519f3fb06", "content-length": "124", "date": "Wed, 15 Jun 2022 22:22:10 GMT"} })
davepacheco commented 2 years ago

The key error seems to be:

Unknown diesel error creating Dataset called "7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c": Failed to find a type oid for `dataset_kind`
smklein commented 2 years ago

This looks like a failure while setting up the sled agent server.

The string Failed to notify Nexus about new Dataset is unique, and comes from sim::Storage::insert_dataset:

https://github.com/oxidecomputer/omicron/blob/0b118032d8d6852d204a73fc6924589d5a08a7e3/sled-agent/src/sim/storage.rs#L223-L241

This function is public, but it has only one callsite:

https://github.com/oxidecomputer/omicron/blob/0b118032d8d6852d204a73fc6924589d5a08a7e3/sled-agent/src/sim/sled_agent.rs#L120-L127

Which is almost certainly being invoked by: https://github.com/oxidecomputer/omicron/blob/0b118032d8d6852d204a73fc6924589d5a08a7e3/nexus/test-utils/src/resource_helpers.rs#L350-L376

The caller side seems "correct" here - this code path is doing what it's supposed to do, AFAICT.

However, that error message implies to me that CRDB is either:

smklein commented 2 years ago

Here's the diesel error:

https://github.com/diesel-rs/diesel/blob/df7ca0ae54a20e4c0d59c13e00d1c07667d61723/diesel/src/pg/backend.rs#L27-L64

dataset_kind is indeed a custom type, initialized here:

https://github.com/oxidecomputer/omicron/blob/0b118032d8d6852d204a73fc6924589d5a08a7e3/common/src/sql/dbinit.sql#L135-L139

And managed in Rust as this type:

https://github.com/oxidecomputer/omicron/blob/0b118032d8d6852d204a73fc6924589d5a08a7e3/nexus/src/db/model/dataset_kind.rs#L10-L23

smklein commented 2 years ago

Due to the "copy-from-seed-db" behavior, which should be initialized with the dataset_kind type already created, I'm not really able to hypothesize a scenario where the request to nexus races with initialization of CRDB (and thus, risks using dataset_kind before CREATE TYPE was invoked).

davepacheco commented 2 years ago

I ran into something similar again:

running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails ... FAILED

failures:

---- integration_tests::disks::test_disk_create_disk_that_already_exists_fails stdout ----
log file: "/dangerzone/omicron_tmp/try_repro.6246/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.7698.0.log"
note: configured to log to "/dangerzone/omicron_tmp/try_repro.6246/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.7698.0.log"
thread 'integration_tests::disks::test_disk_create_disk_that_already_exists_fails' panicked at 'Nexus never loaded users: Internal Error: unexpected database error: Failed to find a type oid for `identity_type`', /home/dap/omicron/nexus/test-utils/src/lib.rs:113:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/try_repro.6246/.tmplxWdo6

failures:
    integration_tests::disks::test_disk_create_disk_that_already_exists_fails

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 74 filtered out; finished in 3.25s
$ bunyan "/dangerzone/omicron_tmp/try_repro.6246/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.7698.0.log"
[2022-06-19T09:26:25.97193573Z]  INFO: test_disk_create_disk_that_already_exists_fails/7698 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/try_repro.6246/.tmplxWdo6
...
[2022-06-19T09:26:28.857351398Z] DEBUG: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: attempting to create built-in role assignments
[2022-06-19T09:26:28.858183736Z] TRACE: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: authorize begin (resource=Database, action=Query)
    actor: Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
[2022-06-19T09:26:28.858950249Z] DEBUG: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: roles (roles="RoleSet { roles: {} }")
[2022-06-19T09:26:28.865577011Z] DEBUG: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: authorize result (result=Ok(()), resource=Database, action=Query)
    actor: Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
[2022-06-19T09:26:29.190035668Z]  WARN: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: failed to list sagas (will retry after 238.566087ms): Internal Error: Unknown diesel error accessing SagaDbg ById(2bc180c4-f5bf-489e-a75c-02be4911b6d0): Failed to find a type oid for `saga_state`
[2022-06-19T09:26:29.190623132Z] ERROR: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: gave up trying to populate built-in PopulateBuiltinRoleAssignments
    error_message: InternalError { internal_message: "unexpected database error: Failed to find a type oid for `identity_type`" }
smklein commented 2 years ago

interesting data point, that the "not found" oid changed (dataset_kind vs saga_state)

smklein commented 2 years ago

Both are created ENUM types, however

smklein commented 2 years ago

And identity_type too, which is also an ENUM

davepacheco commented 2 years ago

Yet another one:

ATTEMPT 693

running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails ... FAILED

failures:

---- integration_tests::disks::test_disk_create_disk_that_already_exists_fails stdout ----
log file: "/dangerzone/omicron_tmp/try_repro.7754/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.27364.0.log"
note: configured to log to "/dangerzone/omicron_tmp/try_repro.7754/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.27364.0.log"
thread 'integration_tests::disks::test_disk_create_disk_that_already_exists_fails' panicked at 'failed to make "create" request: expected status code 201 Created, found 500 Internal Server Error', /home/dap/omicron/nexus/test-utils/src/resource_helpers.rs:59:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/try_repro.7754/.tmpSFxLri
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)

failures:
    integration_tests::disks::test_disk_create_disk_that_already_exists_fails

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 74 filtered out; finished in 8.52s
$ bunyan "/dangerzone/omicron_tmp/try_repro.7754/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.27364.0.log"
[2022-06-21T18:48:22.968901968Z]  INFO: test_disk_create_disk_that_already_exists_fails/27364 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/try_repro.7754/.tmpSFxLri
...
_external/27364 on ivanova: request completed (req_id=2ff0999f-343a-4b6d-a1c2-f1b3f9d3e6cf, uri=/organizations/test-org/projects, method=POST, remote_addr=127.0.0.1:55876, local_addr=127.0.0.1:63954, error_message_external="Internal Server Error", response_code=500)
    error_message_internal: Unknown diesel error creating VpcRouter called "system": Failed to find a type oid for `vpc_router_kind`
[2022-06-21T18:48:31.405038669Z]  INFO: test_disk_create_disk_that_already_exists_fails/external client test context/27364 on ivanova: client received response (status=500)

It seems to be more common lately. That's surprising because I haven't updated my omicron commit. I'm on 095dfb698f9cc6838f968f6bb78df1b1a1fbacfc, which is nearly 236f2ac5bffcac799a12d0cedaa6e0faebd3bb5a (the tip of "main" on May 27). I have recently changed to a custom CockroachDB build, but that was only Friday (after this bug was filed) and all that's changed is using malloc(3c) for the memory allocator.

davepacheco commented 1 year ago

It's conceivable that this is caused by the same underlying illumos issue described in #1146. I'm not sure how we can know this unless we (1) see this again using a cockroach binary without that fix and grab a core file from CockroachDB, or (2) see this again at all once the bug is fixed.