Closed bnaecker closed 1 month ago
Can we use qorb for this once it lands? (This sounds quite similar to #3763 which led to it.)
I started on this today, and I think it'll need a bit of a more invasive change than I was hoping. I wanted to keep the existing reqwest::Client
for now, and set the DNS resolver it uses to the one from the internal_dns
crate. Unfortunately, I don't think that works directly, because reqwest
does not use the port numbers returned from an SRV record to make requests -- it seems to basically throw the port away, resolve the AAAA that the SRV record points to, and then use the default for the URI schema (HTTP, in this case). I'm not sure why it does that.
To use qorb
, we need two things: a resolver, which knows how to lookup backends, and a connector, that knows how to connect to them. The resolver is very easy, that's basically just a small wrapper around the existing Resolver
in the internal_dns
crate. But the connector is less clear. There's no way I can see to pass reqwest::Client
an actual TCP stream to use. Even if there were, then we're obviously not using any of the existing connection pooling in that crate at all, which seems wasteful. We could instead switch to using a lower-level HTTP library like hyper
, which is very low-level and does directly operate over a TCP socket. That might be the best approach, but it's certainly a lot of additional complexity over reqwest
.
An alternative is to eschew the connection pooling for the HTTP interface at all, and instead move all the internals over to the new TCP connection. Integrating qorb
with that new connection type is much more straightforward, since it's really just some framing over a TCP stream. That does put resolving this PR behind all that work though. That's not terrible, but is more than I originally anticipated.
Today, the
oximeter
collector looks up ClickHouse in DNS on startup, and uses that address throughout:https://github.com/oxidecomputer/omicron/blob/d24003b74d232d2ff643a6fc9b5ee10cb5f2b056/oximeter/collector/src/agent.rs#L394-L406
That means the insertion task will always use the same address, even if we move the ClickHouse zones around. We should pass the resolver into that task, or have another way for it to know that it should re-resolve ClickHouse again. It's also important to note this is low priority. We can work around it pretty easily by just restarting the
oximeter
SMF service, which will cause it to resolve ClickHouse again. That may lose a small amount of data buffered in the collector, but it's pretty minor.