oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

Oximeter collector needs a way to lookup ClickHouse more often #6407

Closed bnaecker closed 1 month ago

bnaecker commented 3 months ago

Today, the oximeter collector looks up ClickHouse in DNS on startup, and uses that address throughout:

https://github.com/oxidecomputer/omicron/blob/d24003b74d232d2ff643a6fc9b5ee10cb5f2b056/oximeter/collector/src/agent.rs#L394-L406

That means the insertion task will always use the same address, even if we move the ClickHouse zones around. We should pass the resolver into that task, or have another way for it to know that it should re-resolve ClickHouse again. It's also important to note this is low priority. We can work around it pretty easily by just restarting the oximeter SMF service, which will cause it to resolve ClickHouse again. That may lose a small amount of data buffered in the collector, but it's pretty minor.

jgallagher commented 3 months ago

Can we use qorb for this once it lands? (This sounds quite similar to #3763 which led to it.)

bnaecker commented 2 months ago

I started on this today, and I think it'll need a bit of a more invasive change than I was hoping. I wanted to keep the existing reqwest::Client for now, and set the DNS resolver it uses to the one from the internal_dns crate. Unfortunately, I don't think that works directly, because reqwest does not use the port numbers returned from an SRV record to make requests -- it seems to basically throw the port away, resolve the AAAA that the SRV record points to, and then use the default for the URI schema (HTTP, in this case). I'm not sure why it does that.

To use qorb, we need two things: a resolver, which knows how to lookup backends, and a connector, that knows how to connect to them. The resolver is very easy, that's basically just a small wrapper around the existing Resolver in the internal_dns crate. But the connector is less clear. There's no way I can see to pass reqwest::Client an actual TCP stream to use. Even if there were, then we're obviously not using any of the existing connection pooling in that crate at all, which seems wasteful. We could instead switch to using a lower-level HTTP library like hyper, which is very low-level and does directly operate over a TCP socket. That might be the best approach, but it's certainly a lot of additional complexity over reqwest.

An alternative is to eschew the connection pooling for the HTTP interface at all, and instead move all the internals over to the new TCP connection. Integrating qorb with that new connection type is much more straightforward, since it's really just some framing over a TCP stream. That does put resolving this PR behind all that work though. That's not terrible, but is more than I originally anticipated.