oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

Revisit the use of `CAST` and `COALESCE` in CockroachDB queries #1166

Open bnaecker opened 2 years ago

bnaecker commented 2 years ago

While working on #1162, I ran into some unexpected behavior around CRDB's implementation of CAST and COALESCE. The documentation indicates that any arguments after the first non-NULL argument are not evaluated. That doesn't always appear to be the case.

Why does this matter? We've been using some tailored queries, especially when inserting network interfaces, that are designed to fail in detectable ways. For example this query is used to validate that an instance doesn't have two NICs in multiple VPCs. It does that by creating queries like:

CAST(IF(<instance is in one vpc>, 'uuid-as-a-string', 'garbage') AS UUID)

That is, we cast the string version of the VPC's UUID to an actual UUID type if our check passes, or try and fail to convert the string 'garbage'. That check relies on COALESCE, returning one value if the check passes, and another if it fails.

That's been fine so far, where the alternative is usually a literal value, but it falls apart if we really only want to generate those values conditional on the earlier arguments being NULL. That doesn't appear to be tenable, despite CRDB's documentation. I'll file a CRDB issue to track this, but we can verify that the documentation is incorrect or misleading with a simple query:

select coalesce((select 1), (select max(x) from generate_series(1, 1000000) as x);

That runs in about 1.7 seconds on my machine. Based on the explain-analyze output, it really does appear to be evaluating the second query even though the first returns non-NULL. This is "just" a performance concern thus far. But, during the updates to handle #1162, I tried to use this mechanism in another way, where that subquery is only valid if some preconditions have been met. Using it that way can't work, if all arguments are unconditionally evaluated.

In any case, I'd like to at least revisit or audit how we're currently using these kinds of mechanisms. It seems likely that the performance is impacted, and it's possible, if unlikely, that correctness is in play too.

bnaecker commented 2 years ago

Corresponding CRDB issue filed here.