oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
250 stars 39 forks source link

Need to initialize a fresh ClickHouse database without restarting Oximeter #6826

Open plotnick opened 2 weeks ago

plotnick commented 2 weeks ago

With #6800, we can now expunge a running ClickHouse zone and the reconfigurator will replace it. With #6745, qorb can automatically look up and connect to that new zone, but insertions still fail because the oximeter database does not yet exist:

21:19:19.250Z WARN oximeter (oximeter-agent): failed to insert some results into metric DB: Error interacting with telemetry database: Query failed: Code: 81. DB::Exception: Database oximeter does not exist. (UNKNOWN_DATABASE) (version 23.8.7.1)

I don't know if a client can initialize a non-existent database in ClickHouse. If it can, we may be able to solve this on the client side; if not, we'll need some server-side initialization scheme. See also this comment, where this issue came up previously.

plotnick commented 1 week ago

Update: ClickHouse can indeed add new databases from the client, and indeed we already do so when initializing the Oximeter client. The question now is how and at what level to detect an unitialized database; i.e., entirely in the client, or in cooperation with qorb.

jgallagher commented 1 week ago

It looks like this happens here on main: https://github.com/oxidecomputer/omicron/blob/3a3eb52f54c5b321d4b8f1b52bc525e1cb617fff/oximeter/collector/src/agent.rs#L418-L433

6745 changes this slightly: instead of Client::new(db_address, ...) it becomes Client::new_with_pool(resolver, ...), which means we grab a connection out of the Qorb pool, then perform this db check, but only on initial startup. A Qorb pool can choose to implement on_acquire(). I wonder if instead of "acquire a client from the pool, check the database, then proceed assuming it never changes" it makes sense to make it "upon acquiring any client from the pool check the database, then use the client ..."? cc @smklein

I don't know what the implications are of "any time qorb gets a new connection, potentially try to create the db schema". That seems like the kind of thing that could go sideways, maybe?

bnaecker commented 1 week ago

This initialization is currently only done by the schema updater tool. We could do it elsewhere, that’s just the only place it happens today. The initialization SQL is idempotent, so could be run every time a connection is made.

But there are a few problems with that. First, it takes a while, especially with the current HTTP based client, since each statement has to be run in a separate request. It’s currently a few seconds. That should be faster when we move to the native client, but it will still take at least a few hundred milliseconds.

More importantly the client has to know if it should create a single replica or replicated database. Right now we just have the former, but I gather the plan is to support both for a while, so we’d need the connection pool to also know which set of SQL statements should be used in that hypothetical acquisition method.

Lastly, we don’t run that automatically today for the same reason we don’t do it for CRDB. Changes to the schema are expected to be run explicitly by an operator during mupdate. Maybe it would make sense to include information in the blueprint system about which version the database is running and whether it’s replicated. Then I think we could be more confident about running the initialization SQL more automatically.

Anyway, most of that is context, not trying to be prescriptive about next steps. Hope it helps!

davepacheco commented 1 week ago

Would it make sense to make this part of Reconfigurator's action to add the new zone?

bnaecker commented 1 week ago

@davepacheco I think that makes sense, yeah. If we have information about the database we need to create (replicated or single-node) in the blueprint, then we can run it as part of creating the zone. In the long-run, having clickhouse-admin do that as part of starting up the actual ClickHouse server makes sense. I'm not sure if that's how a the discretionary ClickHouse zones will be launched. But either way, it seems pretty straightforward to run a few lines of code using the oximeter_db::Client.