Open davepacheco opened 2 years ago
The key error seems to be:
Unknown diesel error creating Dataset called "7b3fdf75-02ea-43d8-a29e-5f6e8d590a8c": Failed to find a type oid for `dataset_kind`
This looks like a failure while setting up the sled agent server.
The string Failed to notify Nexus about new Dataset
is unique, and comes from sim::Storage::insert_dataset
:
This function is public, but it has only one callsite:
Which is almost certainly being invoked by: https://github.com/oxidecomputer/omicron/blob/0b118032d8d6852d204a73fc6924589d5a08a7e3/nexus/test-utils/src/resource_helpers.rs#L350-L376
The caller side seems "correct" here - this code path is doing what it's supposed to do, AFAICT.
However, that error message implies to me that CRDB is either:
Here's the diesel error:
dataset_kind
is indeed a custom type, initialized here:
And managed in Rust as this type:
Due to the "copy-from-seed-db" behavior, which should be initialized with the dataset_kind
type already created, I'm not really able to hypothesize a scenario where the request to nexus races with initialization of CRDB (and thus, risks using dataset_kind
before CREATE TYPE
was invoked).
I ran into something similar again:
running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails ... FAILED
failures:
---- integration_tests::disks::test_disk_create_disk_that_already_exists_fails stdout ----
log file: "/dangerzone/omicron_tmp/try_repro.6246/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.7698.0.log"
note: configured to log to "/dangerzone/omicron_tmp/try_repro.6246/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.7698.0.log"
thread 'integration_tests::disks::test_disk_create_disk_that_already_exists_fails' panicked at 'Nexus never loaded users: Internal Error: unexpected database error: Failed to find a type oid for `identity_type`', /home/dap/omicron/nexus/test-utils/src/lib.rs:113:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/try_repro.6246/.tmplxWdo6
failures:
integration_tests::disks::test_disk_create_disk_that_already_exists_fails
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 74 filtered out; finished in 3.25s
$ bunyan "/dangerzone/omicron_tmp/try_repro.6246/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.7698.0.log"
[2022-06-19T09:26:25.97193573Z] INFO: test_disk_create_disk_that_already_exists_fails/7698 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/try_repro.6246/.tmplxWdo6
...
[2022-06-19T09:26:28.857351398Z] DEBUG: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: attempting to create built-in role assignments
[2022-06-19T09:26:28.858183736Z] TRACE: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: authorize begin (resource=Database, action=Query)
actor: Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
[2022-06-19T09:26:28.858950249Z] DEBUG: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: roles (roles="RoleSet { roles: {} }")
[2022-06-19T09:26:28.865577011Z] DEBUG: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: authorize result (result=Ok(()), resource=Database, action=Query)
actor: Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
[2022-06-19T09:26:29.190035668Z] WARN: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: failed to list sagas (will retry after 238.566087ms): Internal Error: Unknown diesel error accessing SagaDbg ById(2bc180c4-f5bf-489e-a75c-02be4911b6d0): Failed to find a type oid for `saga_state`
[2022-06-19T09:26:29.190623132Z] ERROR: 2bc180c4-f5bf-489e-a75c-02be4911b6d0/ServerContext/7698 on ivanova: gave up trying to populate built-in PopulateBuiltinRoleAssignments
error_message: InternalError { internal_message: "unexpected database error: Failed to find a type oid for `identity_type`" }
interesting data point, that the "not found" oid changed (dataset_kind
vs saga_state
)
Both are created ENUM
types, however
And identity_type
too, which is also an ENUM
Yet another one:
ATTEMPT 693
running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails ... FAILED
failures:
---- integration_tests::disks::test_disk_create_disk_that_already_exists_fails stdout ----
log file: "/dangerzone/omicron_tmp/try_repro.7754/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.27364.0.log"
note: configured to log to "/dangerzone/omicron_tmp/try_repro.7754/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.27364.0.log"
thread 'integration_tests::disks::test_disk_create_disk_that_already_exists_fails' panicked at 'failed to make "create" request: expected status code 201 Created, found 500 Internal Server Error', /home/dap/omicron/nexus/test-utils/src/resource_helpers.rs:59:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/try_repro.7754/.tmpSFxLri
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
failures:
integration_tests::disks::test_disk_create_disk_that_already_exists_fails
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 74 filtered out; finished in 8.52s
$ bunyan "/dangerzone/omicron_tmp/try_repro.7754/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.27364.0.log"
[2022-06-21T18:48:22.968901968Z] INFO: test_disk_create_disk_that_already_exists_fails/27364 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/try_repro.7754/.tmpSFxLri
...
_external/27364 on ivanova: request completed (req_id=2ff0999f-343a-4b6d-a1c2-f1b3f9d3e6cf, uri=/organizations/test-org/projects, method=POST, remote_addr=127.0.0.1:55876, local_addr=127.0.0.1:63954, error_message_external="Internal Server Error", response_code=500)
error_message_internal: Unknown diesel error creating VpcRouter called "system": Failed to find a type oid for `vpc_router_kind`
[2022-06-21T18:48:31.405038669Z] INFO: test_disk_create_disk_that_already_exists_fails/external client test context/27364 on ivanova: client received response (status=500)
It seems to be more common lately. That's surprising because I haven't updated my omicron commit. I'm on 095dfb698f9cc6838f968f6bb78df1b1a1fbacfc, which is nearly 236f2ac5bffcac799a12d0cedaa6e0faebd3bb5a (the tip of "main" on May 27). I have recently changed to a custom CockroachDB build, but that was only Friday (after this bug was filed) and all that's changed is using malloc(3c) for the memory allocator.
It's conceivable that this is caused by the same underlying illumos issue described in #1146. I'm not sure how we can know this unless we (1) see this again using a cockroach binary without that fix and grab a core file from CockroachDB, or (2) see this again at all once the bug is fixed.
While trying to repro #1130, I ran into a failure I hadn't seen before and can't find a ticket for:
Log file: