Open sunshowers opened 9 months ago
I have what feels like a similar issue, but with different tests hanging. In this run, the following tests hung.
omicron-nexus::test_all integration_tests::rack::test_rack_initialization
omicron-nexus app::update::tests::test_update_deployments
Here is a link to the CI job. In this run, the illumos integration tests passed, but the Linux ones did not. This also comes from a commit that only changed an unrelated buildomat script and the previous commit on that branch passed CI on the Linux build-and-test job.
Just observed this failure here: https://github.com/oxidecomputer/omicron/pull/4783/checks?check_run_id=20286750269
omicron-nexus::test_all integration_tests::images::test_make_disk_from_image
appears to have been running non-stop for almost two hours. Looks like this happened on Ubuntu-22.04.
Reading what's going on from just the Nexus log is getting kinda difficult - the rendered log is 25,000 lines long, of which ~20,000 of those lines are from BackgroundTasks
.
https://github.com/oxidecomputer/omicron/pull/4780 should hopefully land very soon, just waiting on the deploy job.
Couple more cases that just showed up: https://github.com/oxidecomputer/omicron/pull/4783
[5411](https://buildomat.eng.oxide.computer/wg/0/details/01HKPQ0KFEGHYZHDSCY6RC9JMB/1BHKCrXA6UrkcAbAz5lm6dq48ZkaSiQTlDYpWR0cbxtrF0bK/01HKPQ1NSF26P9RMQR6KSEPM7V#S5411) 2024-01-09T11:12:47.656Z Canceling due to signal: 2 tests still running
[5412](https://buildomat.eng.oxide.computer/wg/0/details/01HKPQ0KFEGHYZHDSCY6RC9JMB/1BHKCrXA6UrkcAbAz5lm6dq48ZkaSiQTlDYpWR0cbxtrF0bK/01HKPQ1NSF26P9RMQR6KSEPM7V#S5412) 2024-01-09T11:12:47.665Z SIGTERM [6351.638s] omicron-nexus::test_all integration_tests::oximeter::test_oximeter_reregistration
[5413](https://buildomat.eng.oxide.computer/wg/0/details/01HKPQ0KFEGHYZHDSCY6RC9JMB/1BHKCrXA6UrkcAbAz5lm6dq48ZkaSiQTlDYpWR0cbxtrF0bK/01HKPQ1NSF26P9RMQR6KSEPM7V#S5413) 2024-01-09T11:12:47.669Z SIGTERM [6220.087s] omicron-nexus::test_all integration_tests::role_assignments::test_role_assignments_silo
(These also look wholly unrelated to the PR, but same symptom. Something must have merged recently to cause this timeout across such a broad set of tests to trigger much more frequently)
Here's a post-#4780 failure: https://github.com/oxidecomputer/omicron/runs/20299512003
Link to failing stdout/stderr. Sadly this doesn't appear to be helpful.
While looking at the logs I found this interesting issue:
2024-01-09T12:32:00.369Z WARN test_reject_creating_disk_from_snapshot (clickhouse-client): failed to read version
collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
error = Telemetry database unavailable: error sending request for url (http://[::1]:465/?output_format_json_quote_64bit_integers=0): error trying to connect: tcp connect error: Connection refused (os error 111)
file = oximeter/db/src/client.rs:868
id = dcf208b1-e86c-4472-a0ac-8874804fe091
2024-01-09T12:32:00.371Z WARN test_reject_creating_disk_from_snapshot (oximeter): failed to create ClickHouse client
error = Database(DatabaseUnavailable("error sending request for url (http://[::1]:465/?output_format_json_quote_64bit_integers=0): error trying to connect: tcp connect error: Connection refused (os error 111)"))
file = oximeter/collector/src/lib.rs:213
retry_after = 167.267012453s
The port number seems suspiciously low -- it's definitely below 1024 which is strange.
Wondering if #4755 is responsible. I'm not sure exactly why but looking at commits on main, it looks like test_all
started flaking out after that commit landed.
That's an interesting correlation, and I also don't see why that commit would cause it. For background, ClickHouse is started with a port of 0; it gets some available one from the OS; and then we fish the actual port it bound out of the log file. That machinery has all worked pretty well for a while now, but it's possible moving the log files around introduced some problem. It also seems possible to get a port < 1024 when run with elevated priveleges, so perhaps this is a red herring if the tests are run that way.
We sadly don't have logs for clickhouse itself as part of the failed test uploads -- we should probably work to upload them.
Hmm, I'm not sure why we don't have those ClickHouse log files. It sure looks like TMPDIR
is set to /var/tmp/omicron_tmp
in .github/buildomat/build-and-test.sh
, and the ClickHouseInstance
will use Utf8TempDir
for storage. Is it possible that the buildomat rules for uploading artifacts misses those? IIRC, the temporary directories for ClickHouse usually look like /tmp/.tmppEavQr
, so perhaps the leading .
causes them to be missed.
I don't think that's it. Buildomat uses glob
to find files matching the output rules, and that crate seems to find tempdirs created with a leading .
just fine.
It looks like it's not always a port below 1024 -- in this example this was port 3981.
We also don't have success logs to confirm that the ClickHouse issue is actually correlated.
However, looking at the code which generates this warning it does seem like it would get stuck here. I believe it gets invoked via start_oximeter
, which doesn't have an associated timeout. So the next steps are:
start_oximeter
: https://github.com/oxidecomputer/omicron/pull/4789So I haven't been able to reproduce the flakiness with #4796, in any of the 5 runs that happened on the PR, or on main (https://github.com/oxidecomputer/omicron/runs/20441805293). That is really weird -- in a well-behaved system, 4796 would not have any impact on this.
So how could this happen?
My current reasoning is: 4796 does two things.
$TMPDIR/clickhouse-...
to $TMPDIR/<process-name>.<test-name>.<pid>.<id>-clickhouse-...
.If 4796 does not actually address the flakiness, we continue to dig, now with better logs.
If 4796 does fix the flakiness -- then it surely can't be 2, because that code only kicks in after the test has failed. So the only thing it can be is 1.
What could 1 be caused by in practice? The best thing I can imagine is something that removes those directories while the clickhouse process is running. That may suggest a reason why #4755 caused the flakiness (it moved the cwd to inside the temp dir, and processes generally don't like their cwds being deleted).
Assuming this hypothetical, what could this process be? I can imagine one of two things:
This is all really strange.
So it does look like 4796 fixed the flakiness...
We discussed this in the Control Plane sync today and here's what we decided:
I was hoping to timebox digging into this for last week. That has been successful in the sense that the flakiness has been addressed, but we haven't root caused the issue. However I have a bigger company priority (automated update) to work on, so I'm going to put this on the backburner for now while working on more pressing issues.
If and when we come back to this in the future, a few places to start looking:
/var/tmp/omicron_tmp
based on the process that did them.@sunshowers I'm pretty confident the underlying issue around discovery of the correct ports will be resolved by #6655. But it seems like this issue is also about the failure to upload the log files. What's your opinion? Would you like to keep it open to track that aspect of it; or maybe we rename this issue or open a new one?
Are ClickHouse log files currently uploaded for failing tests? Sorry it's been a while!
Sorry it's been a while!
It sure has!
Are ClickHouse log files currently uploaded for failing tests?
I think so. At least, one of my recent PRs that had a failed test job does appear to have uploaded the files.
Some example buildomat runs:
On main: https://github.com/oxidecomputer/omicron/runs/20227366854
On a PR: https://github.com/oxidecomputer/omicron/pull/4773/checks?check_run_id=20231856893