oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
227 stars 35 forks source link

test hung after CockroachDB died during initialization #1224

Open davepacheco opened 2 years ago

davepacheco commented 2 years ago

In #1223, I found the test hung after CockroachDB died on SIGSEGV.

The context was I was running the try_repro.sh script to try to repro #1130. This runs (among other things) ./target/debug/deps/test_all-d586ea57740e3382 test_disk_create_disk_that_already_exists_fails in a loop. This was on commit 095dfb698f9cc6838f968f6bb78df1b1a1fbacfc, which is not from "main", but is pretty close to 236f2ac5bffcac799a12d0cedaa6e0faebd3bb5a from "main".

When I found the problem, the test was hung, having emitted only this at the end:

ATTEMPT 246

running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails has been running for over 60 seconds

The punchline is that we were stuck during test setup waiting for Oximeter to register with Nexus. This was in an infinite backoff (that is, infinite tries, not an infinite delay) because these requests were failing because the database was down.

davepacheco commented 2 years ago

The entire temp directory (including the test log, CockroachDB data directory, Clickhouse directory, etc.) is attached to #1223. Here are the processes running

21113  /bin/bash ./try_repro.sh
  26037  ../target/debug/deps/test_all-d586ea57740e3382 test_disk_create_disk_that_alrea
    26042  <defunct>
    26046  clickhouse server --log-file /dangerzone/omicron_tmp/try_repro.21113/.tmpUduhcD

pid 26042 is the CockroachDB process that exited on SIGSEGV. So why is the test program hung?

The test we're running starts like this:

#[nexus_test]
async fn test_disk_create_disk_that_already_exists_fails(
    cptestctx: &ControlPlaneTestContext,
) {
    let client = &cptestctx.external_client;
    DiskTest::new(&cptestctx).await;

Now, the #[nexus-test] macro causes us to invoke nexus_test_utils::test_setup, and I believe we're hung inside there. I'll get to that in a minute. The last entries leading up to the problem are these:

[2022-06-16T22:45:19.62853269Z]  INFO: a0404ffe-187e-4aff-8fff-66edacfb6a29/ServerContext/26037 on ivanova: created 1 silo user role assignments
[2022-06-16T22:45:19.629129481Z]  INFO: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: setting up sled agent server (sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
[2022-06-16T22:45:19.676782022Z]  INFO: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: created simulated sled agent (server=b6d65341-167c-41df-9b5c-41cded99c229, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, sim_mode=Explicit)
[2022-06-16T22:45:19.681422368Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: registered endpoint (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, path=/disks/{disk_id}, method=PUT)
[2022-06-16T22:45:19.681904141Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: registered endpoint (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, path=/disks/{disk_id}/poke, method=POST)
[2022-06-16T22:45:19.682314018Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: registered endpoint (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, path=/instances/{instance_id}, method=PUT)
[2022-06-16T22:45:19.682701844Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: registered endpoint (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, path=/instances/{instance_id}/poke, method=POST)
[2022-06-16T22:45:19.683034514Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: registered endpoint (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, path=/update, method=POST)
[2022-06-16T22:45:19.683373236Z]  INFO: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: listening (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
[2022-06-16T22:45:19.68366514Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: successfully registered DTrace USDT probes (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
[2022-06-16T22:45:19.683941303Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: contacting server nexus (sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
[2022-06-16T22:45:19.684179965Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: client request (sled_id=b6d65341-167c-41df-9b5c-41cded99c229, body=Some(Body), method=POST)
    uri: http://127.0.0.1:51560/sled_agents/b6d65341-167c-41df-9b5c-41cded99c229
[2022-06-16T22:45:19.690898932Z]  INFO: a0404ffe-187e-4aff-8fff-66edacfb6a29/dropshot_internal/26037 on ivanova: accepted connection (local_addr=127.0.0.1:51560, remote_addr=127.0.0.1:49049)
[2022-06-16T22:45:19.69231185Z] TRACE: a0404ffe-187e-4aff-8fff-66edacfb6a29/dropshot_internal/26037 on ivanova: incoming request (req_id=786bab2b-6b7e-4480-854d-fd403ed6283f, uri=/sled_agents/b6d65341-167c-41df-9b5c-41cded99c229, method=POST, remote_addr=127.0.0.1:49049, local_addr=127.0.0.1:51560)
[2022-06-16T22:45:19.692820174Z]  INFO: a0404ffe-187e-4aff-8fff-66edacfb6a29/ServerContext/26037 on ivanova: registered sled agent (sled_uuid=b6d65341-167c-41df-9b5c-41cded99c229)
[2022-06-16T22:45:19.736664225Z]  INFO: a0404ffe-187e-4aff-8fff-66edacfb6a29/dropshot_internal/26037 on ivanova: request completed (req_id=786bab2b-6b7e-4480-854d-fd403ed6283f, uri=/sled_agents/b6d65341-167c-41df-9b5c-41cded99c229, method=POST, remote_addr=127.0.0.1:49049, local_addr=127.0.0.1:51560, response_code=204)
[2022-06-16T22:45:19.737492786Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: client response (sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
    result: Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(51560), path: "/sled_agents/b6d65341-167c-41df-9b5c-41cded99c229", query: None, fragment: None }, status: 204, headers: {"x-request-id": "786bab2b-6b7e-4480-854d-fd403ed6283f", "date": "Thu, 16 Jun 2022 22:45:19 GMT"} })
[2022-06-16T22:45:19.942180224Z]  INFO: a0404ffe-187e-4aff-8fff-66edacfb6a29/dropshot_internal/26037 on ivanova: accepted connection (local_addr=127.0.0.1:51560, remote_addr=127.0.0.1:39145)
[2022-06-16T22:45:19.942593397Z] TRACE: a0404ffe-187e-4aff-8fff-66edacfb6a29/dropshot_internal/26037 on ivanova: incoming request (req_id=a335d9db-2db1-4cfc-93de-31431b047b83, uri=/metrics/collectors, method=POST, remote_addr=127.0.0.1:39145, local_addr=127.0.0.1:51560)
[2022-06-16T22:45:50.727619201Z]  INFO: a0404ffe-187e-4aff-8fff-66edacfb6a29/dropshot_internal/26037 on ivanova: request completed (req_id=a335d9db-2db1-4cfc-93de-31431b047b83, uri=/metrics/collectors, method=POST, remote_addr=127.0.0.1:39145, local_addr=127.0.0.1:51560, error_message_external="Service Unavailable", error_message_internal="Timeout accessing connection pool", response_code=503)

After that, all we have are repeats of those last two entries: incoming metric collector requests followed by 503 errors. The log goes on until 2022-06-17T14:56:04.528359809Z with no other entries of note.

How does this timing interact with the CockroachDB exit? CockroachDB reported that it was starting to dump stderr at 22:45:17.402:

$ grep 'stderr capture' /dangerzone/omicron_tmp/try_repro.21113/.tmptw5KEL/data/logs/cockroach-stderr.ivanova.dap.2022-06-16T22_45_17Z.026042.log 
I220616 22:45:17.402378 1 util/log/flags.go:201  [-] 1  stderr capture started

but the time of the file is 22:45:20.716:

$ TZ=utc ls -lErt /dangerzone/omicron_tmp/try_repro.21113/.tmptw5KEL/data/logs/cockroach-stderr.ivanova.dap.2022-06-16T22_45_17Z.026042.log 
-rw-r-----   1 dap      staff     287541 2022-06-16 22:45:20.716442675 +0000 /dangerzone/omicron_tmp/try_repro.21113/.tmptw5KEL/data/logs/cockroach-stderr.ivanova.dap.2022-06-16T22_45_17Z.026042.log

so it took over 3 seconds to do that. (Interestingly, the test log shows successful requests that would have hit the database that started as late as 22:45:19.69231185Z, well after CockroachDB identified the fatal failure. So it was still functioning for a few seconds after that.)

Anyway, this is certainly around the time that the test stopped making forward progress. Now where exactly were we? The last entry that I can pin to a particular point in setup is this one:

[2022-06-16T22:45:19.737492786Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: client response (sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
    result: Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(51560), path: "/sled_agents/b6d65341-167c-41df-9b5c-41cded99c229", query: None, fragment: None }, status: 204, headers: {"x-request-id": "786bab2b-6b7e-4480-854d-fd403ed6283f", "date": "Thu, 16 Jun 2022 22:45:19 GMT"} })

That's the client in the simulated sled agent reporting a successful request to Nexus to register the sled agent. That would be happen within:

omicron_sled_agent::sim::Server::start()
nexus_test_utils::start_sled_agent()
nexus_test_utils::test_setup_with_config()
nexus_test_utils::test_setup()
#[nexus_test] macro

which is about here:

https://github.com/oxidecomputer/omicron/blob/236f2ac5bffcac799a12d0cedaa6e0faebd3bb5a/sled-agent/src/sim/server.rs#L73

Now, in the config created by the caller, there are no zpools configured, so I think we might pop many frames back up and start the Oximeter server and producer. I don't see the "starting oximeter server" log message, but I think that's because the Oximeter config uses a different log config instead of the test suite one. I'll file a separate bug for this.

At this point, I expect our stack to look like this:

oximeter::collector::Oximeter::new()
nexus_test_utils::start_oximeter()
nexus_test_utils::test_setup_with_config()
nexus_test_utils::test_setup()
#[nexus_test] macro

Inside new(), we make the repeated attempts to register Oximeter with Nexus: https://github.com/oxidecomputer/omicron/blob/236f2ac5bffcac799a12d0cedaa6e0faebd3bb5a/oximeter/collector/src/lib.rs#L422-L454

and that's the request that we're seeing repeatedly in the test log failing with a 503 because the database is down.

The only real fix here is to time out the test if it's taking so long. (It's been 6 hours in this state.) I'm not sure how best to do that: we could have each of the nexus_test_utils::start_... functions time out their attempt to start the respective server (oximeter in this case), or we could have some top-level test timeout. On the other hand, for a problem like this, it's possible we'll want core files and other data (pfiles, pldd, pargs, pargs -e) output for all the processes involved. Unless we're going to save all that, it might be better to let it hang so a person can come in and debug it as I just did.

Note this would be a lot faster to debug if we had the Oximeter log because it would show very explicitly that it was trying to register itself, that it failed, and that it was planning to retry. I'll file a separate bug about that.

davepacheco commented 1 year ago

I've put some data (pfiles output and core file) from the test process on catacomb:

dap@catacomb /data/staff/core/issues $ ls -l omicron-1224
total 192851
-rw-rw-r--+  1 dap      staff    323863373 Jun 17 16:58 core.26037
-rw-rw-r--+  1 dap      staff       3013 Jun 17 16:58 pfiles.26037
-rw-rw-r--+  1 dap      staff       2229 Jun 17 16:59 pfiles.core
dap@catacomb /data/staff/core/issues $