oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

Push more OxQL table operations into the database #6480

Open bnaecker opened 3 months ago

bnaecker commented 3 months ago

Today, OxQL queries execute partially in the ClickHouse database, and partially in the Rust library interface. For example, given the query:

get sled_data_link:bytes_sent | filter sled_id == "<UUID>" | align mean_within(1m)

The filtering portion of this runs in the database, while the alignment runs in Rust. The library tries to do a lot of work to push as many filtering operations into the database as possible, so that we ultimately select the smallest set of data we need. The alignment operation here is much more difficult to do in ClickHouse, especially so for cumulative counters, but it's pretty straightforward in Rust.

However, this comes with some obvious tradeoffs. One, we are writing our own numerical code, rather than relying on the more robust implementations in ClickHouse. We also may need to pay the network and memory cost of transferring much more data than the query will spit out at the end. Using the example above, suppose the underlying data were sampled at 1Hz, and we wanted to average into 1 day windows. We'd be paying to transfer nearly 100k times as much data from the database as the query will ultimately return (24h * 3600 samples / h).

We should invest the time to do those kinds of operations in the database to the extent possible. For gauges, this is actually easier than I remember. Take the following query, looking at the last day of the new hardware_component:temperature data for one timeseries, averaged into 1 hour bins:

get hardware_component | filter chassis_id == "<UUID>" && timestamp > @now() - 1d | align mean_within(1h)

In this query, the alignment / time averaging part can be done in a few ways. One is window functions, but I think the more robust approach is using groupArray(), like so:

SELECT
    toStartOfInterval(timestamp, toIntervalHour(1)) AS interval_start,
    arrayAvg(groupArray(datum))
FROM measurements_f32
WHERE (timeseries_name = 'hardware_component:temperature') AND (timeseries_key = 3019922429078673785)
GROUP BY interval_start
ORDER BY interval_start ASC
LIMIT 10

That's just for one timeseries, but we can apply it for any number of them without a lot of trouble. It'll take some work to get this pushed into the DB, but I think it should be feasible without a huge headache.

Cumulative counters

This all goes to shit when you consider the cumulative counters. The big problem with these is that each sample is not independent. If we average two samples next door to each other, we're more heavily weighting the data from the first sample. This is the whole reason we construct deltas from the cumulative counters in OxQL, by doing an adjacent difference.

So what's the problem? The "adjacent difference" operation is really difficult to express in SQL, for a whole host of reasons. Usually, tables don't really have any meaningful ordering beyond what you impose on them with an ORDER BY clause, so the relationship between rows isn't well-defined. Even if we could correctly express that, there are other issues. One is that we only want to do this adjacent diff within a single timeseries. But I don't see a way to prevent the diff from operating between the first row of one timeseries and the last one of the previous.

There are two ways forward that I can see now: use arrays, or restructure the cumulative data to store deltas directly.

Storing deltas is attractive in a lot of ways. It pushes the complexity to insertion, which is almost always over smaller amounts of data. It completely gets around this entire problem of computing deltas at query time, which we do in Rust because of how tricky it is to express in SQL. There is one big downside though: if we select a chunk of data that does not include the first sample, we don't know the absolute value for the data points. It's possible we can get around that by storing both the cumulative (the data we currently store) and also the deltas though, paying the storage cost.

Arrays seem like a good way around this, though they're not a panacea. They are definitely more tricky to work with, for one thing. I think we'd need a combination of groupArray() + arrayDifference() + arrayJoin() here to compute the diff within a timeseries and then reconstitute it as a full table for extraction. Arrays are also limited to 1 million elements, though we can always simply bump that in the patches that we already carry for our version of ClickHouse.

Summary

In any case, I might try to implement the averaging in the database only for gauge timeseries now, where we don't need to worry about computing differences. That would help build up some infrastructure for marking which operations have already been done in the database, vs which still need to be done in Rust. And I can also investigate changing the data storage mechanisms in parallel as well.

bnaecker commented 3 months ago

One is window functions, but I think the more robust approach is using groupArray()

I should have said "more interpretable" approach. But for completeness, the above array-based query is equivalent to:

SELECT
    interval_start,
    any(avg)
FROM
(
    SELECT
        toStartOfInterval(timestamp, toIntervalSecond(10)) AS interval_start,
        avg(datum) OVER (PARTITION BY interval_start) AS avg
    FROM measurements_f32
    WHERE (timeseries_name = 'hardware_component:temperature') AND (timeseries_key = 3019922429078673785)
)
GROUP BY interval_start
ORDER BY interval_start ASC
LIMIT 10

EDIT: I did some quick benchmarking which seemed to show a performance difference between these two, but they seem largely the same in time and memory consumption. I did the benchmarking incorrectly.

bnaecker commented 2 months ago

It just occurred to me that resolving this issue might help with pagination.

I explicitly punted on pagination for OxQL queries. It's not clear how exactly to do that, since the thing you'd expect the pagination limit to apply to, the data points, are nested pretty deeply in the returned object. We did however get one of the benefits of paginating, limiting the total amount of data processed on the server, by implementing a crude overall limit on the number of samples that could be fetched from the database. The logic and reasoning for that limit are described here:

https://github.com/oxidecomputer/omicron/blob/1b43a0ad22c7435d3e9efe3be7996d604631fad9/oximeter/db/src/client/oxql.rs#L74-L100

We might be able to implement pagination with this trick, but it'd really help matters if we can do the entire query in the database. Instead of us hard-coding this value, we could take it from the user request. Then we'd run the query, limiting the number of overall records to this value. We could then construct a pagination key using the timestamp of the last value we fetched. Then we'd add that to the next round through the query, so that you'd effectively be doing cursor-based pagination. I don't know how well that would work. I don't think it's entirely reliant on being able to push the query into the DB, but it would probably help.

bnaecker commented 1 month ago

We actually don't need a window function at all for gauges, we can do a group-by that aggregates things with the same start time:

SELECT
    timeseries_key,
    toStartOfInterval(timestamp, INTERVAL 1 SECOND) AS timestamp,
    avg(datum) AS datum
FROM measurements_f32
WHERE
    (timeseries_name = 'hardware_component:temperature') AND (timeseries_key = 3019922429078673785)
GROUP BY timeseries_key, timestamp