Some tests in omicron-nexus::test_all stall out and hang

sunshowers commented 9 months ago

Some example buildomat runs:

On main: https://github.com/oxidecomputer/omicron/runs/20227366854

omicron-nexus::test_all integration_tests::projects::test_project_deletion_with_vpc
omicron-nexus::test_all integration_tests::silos::test_saml_idp_metadata_data_valid

On a PR: https://github.com/oxidecomputer/omicron/pull/4773/checks?check_run_id=20231856893

omicron-nexus::test_all integration_tests::console_api::test_session_me_groups

rcgoodfellow commented 9 months ago

I have what feels like a similar issue, but with different tests hanging. In this run, the following tests hung.

omicron-nexus::test_all integration_tests::rack::test_rack_initialization
omicron-nexus app::update::tests::test_update_deployments

Here is a link to the CI job. In this run, the illumos integration tests passed, but the Linux ones did not. This also comes from a commit that only changed an unrelated buildomat script and the previous commit on that branch passed CI on the Linux build-and-test job.

smklein commented 9 months ago

Just observed this failure here: https://github.com/oxidecomputer/omicron/pull/4783/checks?check_run_id=20286750269

omicron-nexus::test_all integration_tests::images::test_make_disk_from_image appears to have been running non-stop for almost two hours. Looks like this happened on Ubuntu-22.04.

Reading what's going on from just the Nexus log is getting kinda difficult - the rendered log is 25,000 lines long, of which ~20,000 of those lines are from BackgroundTasks.

sunshowers commented 9 months ago

https://github.com/oxidecomputer/omicron/pull/4780 should hopefully land very soon, just waiting on the deploy job.

smklein commented 9 months ago

Couple more cases that just showed up: https://github.com/oxidecomputer/omicron/pull/4783

[5411](https://buildomat.eng.oxide.computer/wg/0/details/01HKPQ0KFEGHYZHDSCY6RC9JMB/1BHKCrXA6UrkcAbAz5lm6dq48ZkaSiQTlDYpWR0cbxtrF0bK/01HKPQ1NSF26P9RMQR6KSEPM7V#S5411)  2024-01-09T11:12:47.656Z       Canceling due to signal: 2 tests still running
[5412](https://buildomat.eng.oxide.computer/wg/0/details/01HKPQ0KFEGHYZHDSCY6RC9JMB/1BHKCrXA6UrkcAbAz5lm6dq48ZkaSiQTlDYpWR0cbxtrF0bK/01HKPQ1NSF26P9RMQR6KSEPM7V#S5412)  2024-01-09T11:12:47.665Z         SIGTERM [6351.638s] omicron-nexus::test_all integration_tests::oximeter::test_oximeter_reregistration
[5413](https://buildomat.eng.oxide.computer/wg/0/details/01HKPQ0KFEGHYZHDSCY6RC9JMB/1BHKCrXA6UrkcAbAz5lm6dq48ZkaSiQTlDYpWR0cbxtrF0bK/01HKPQ1NSF26P9RMQR6KSEPM7V#S5413)  2024-01-09T11:12:47.669Z         SIGTERM [6220.087s] omicron-nexus::test_all integration_tests::role_assignments::test_role_assignments_silo

(These also look wholly unrelated to the PR, but same symptom. Something must have merged recently to cause this timeout across such a broad set of tests to trigger much more frequently)

sunshowers commented 9 months ago

Here's a post-#4780 failure: https://github.com/oxidecomputer/omicron/runs/20299512003

Link to failing stdout/stderr. Sadly this doesn't appear to be helpful.

sunshowers commented 9 months ago

While looking at the logs I found this interesting issue:

2024-01-09T12:32:00.369Z    WARN    test_reject_creating_disk_from_snapshot (clickhouse-client): failed to read version
    collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
    error = Telemetry database unavailable: error sending request for url (http://[::1]:465/?output_format_json_quote_64bit_integers=0): error trying to connect: tcp connect error: Connection refused (os error 111)
    file = oximeter/db/src/client.rs:868
    id = dcf208b1-e86c-4472-a0ac-8874804fe091
2024-01-09T12:32:00.371Z    WARN    test_reject_creating_disk_from_snapshot (oximeter): failed to create ClickHouse client
    error = Database(DatabaseUnavailable("error sending request for url (http://[::1]:465/?output_format_json_quote_64bit_integers=0): error trying to connect: tcp connect error: Connection refused (os error 111)"))
    file = oximeter/collector/src/lib.rs:213
    retry_after = 167.267012453s

The port number seems suspiciously low -- it's definitely below 1024 which is strange.

Wondering if #4755 is responsible. I'm not sure exactly why but looking at commits on main, it looks like test_all started flaking out after that commit landed.

bnaecker commented 9 months ago

That's an interesting correlation, and I also don't see why that commit would cause it. For background, ClickHouse is started with a port of 0; it gets some available one from the OS; and then we fish the actual port it bound out of the log file. That machinery has all worked pretty well for a while now, but it's possible moving the log files around introduced some problem. It also seems possible to get a port < 1024 when run with elevated priveleges, so perhaps this is a red herring if the tests are run that way.

sunshowers commented 9 months ago

We sadly don't have logs for clickhouse itself as part of the failed test uploads -- we should probably work to upload them.

bnaecker commented 9 months ago

Hmm, I'm not sure why we don't have those ClickHouse log files. It sure looks like TMPDIR is set to /var/tmp/omicron_tmp in .github/buildomat/build-and-test.sh, and the ClickHouseInstance will use Utf8TempDir for storage. Is it possible that the buildomat rules for uploading artifacts misses those? IIRC, the temporary directories for ClickHouse usually look like /tmp/.tmppEavQr, so perhaps the leading . causes them to be missed.

bnaecker commented 9 months ago

I don't think that's it. Buildomat uses glob to find files matching the output rules, and that crate seems to find tempdirs created with a leading . just fine.

sunshowers commented 9 months ago

It looks like it's not always a port below 1024 -- in this example this was port 3981.

We also don't have success logs to confirm that the ClickHouse issue is actually correlated.

However, looking at the code which generates this warning it does seem like it would get stuck here. I believe it gets invoked via start_oximeter, which doesn't have an associated timeout. So the next steps are:

Gather ClickHouse logs and ensure they're uploaded: https://github.com/oxidecomputer/omicron/pull/4796
Add a timeout to start_oximeter: https://github.com/oxidecomputer/omicron/pull/4789

sunshowers commented 9 months ago

So I haven't been able to reproduce the flakiness with #4796, in any of the 5 runs that happened on the PR, or on main (https://github.com/oxidecomputer/omicron/runs/20441805293). That is really weird -- in a well-behaved system, 4796 would not have any impact on this.

So how could this happen?

My current reasoning is: 4796 does two things.

Renames clickhouse temp dirs from $TMPDIR/clickhouse-... to $TMPDIR/<process-name>.<test-name>.<pid>.<id>-clickhouse-....
Preserves ClickHouse logs on test failures.

If 4796 does not actually address the flakiness, we continue to dig, now with better logs.

If 4796 does fix the flakiness -- then it surely can't be 2, because that code only kicks in after the test has failed. So the only thing it can be is 1.

What could 1 be caused by in practice? The best thing I can imagine is something that removes those directories while the clickhouse process is running. That may suggest a reason why #4755 caused the flakiness (it moved the cwd to inside the temp dir, and processes generally don't like their cwds being deleted).

Assuming this hypothetical, what could this process be? I can imagine one of two things:

Something checked into Omicron. But then why would the flakiness only happen on Linux and not on Helios?
Some kind of tempfile cleaner. But why would that cleaner spare the cwd when it's named one way but not the other?

This is all really strange.

sunshowers commented 8 months ago

So it does look like 4796 fixed the flakiness...

sunshowers commented 8 months ago

We discussed this in the Control Plane sync today and here's what we decided:

I was hoping to timebox digging into this for last week. That has been successful in the sense that the flakiness has been addressed, but we haven't root caused the issue. However I have a bigger company priority (automated update) to work on, so I'm going to put this on the backburner for now while working on more pressing issues.

If and when we come back to this in the future, a few places to start looking:

Trying to revert change 1 in 4796 (renaming dirs) and seeing if we get any useful logs from clickhouse thanks to change 2.
If it is true that something is wiping out the cwd of the clickhouse process at the wrong time, writing an eBPF script which tracks unlinks in /var/tmp/omicron_tmp based on the process that did them.

bnaecker commented 1 week ago

@sunshowers I'm pretty confident the underlying issue around discovery of the correct ports will be resolved by #6655. But it seems like this issue is also about the failure to upload the log files. What's your opinion? Would you like to keep it open to track that aspect of it; or maybe we rename this issue or open a new one?

sunshowers commented 1 week ago

Are ClickHouse log files currently uploaded for failing tests? Sorry it's been a while!

bnaecker commented 1 week ago

Sorry it's been a while!

It sure has!

Are ClickHouse log files currently uploaded for failing tests?

I think so. At least, one of my recent PRs that had a failed test job does appear to have uploaded the files.

oxidecomputer / omicron

Some tests in omicron-nexus::test_all stall out and hang #4779