oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

Resource utilization metrics could report samples even when there are no changes #5289

Open bnaecker opened 8 months ago

bnaecker commented 8 months ago

Nexus currently keeps track of the number of resources provisioned, CPUs, memory, and disks. Those go into the utilization views in the web console:

Screen Shot 2024-03-19 at 11 17 22

As you can see there, the data is sampled irregularly. In particular, Nexus generates samples only when there are changes to the data. That is reasonable from an efficiency perspective, since it only reports deltas. However, it makes querying and the graph shown above more painful. It's impossible to know, for example, if any particular time range contains any data. That leads the console implementation to do grubby things like find the latest sample before and after the requested time range.

An alternative would be to report data for intervals even in which there are no changes. This greatly simplifies querying, graphing, and understanding the data, at the obvious expense of transferring more data. The size here is non-trivial, being linear in the number of distinct "virtual collections", which I believe means non-deleted projects, silos, and fleets.

One subtlety here is that we may still wish to report each change, not the total change in the sample period. For example, if a user provisions two new VMs in a sample period, do we report two samples, or one with the sum of the new provisioned resources? I'd expect we want the former, to avoid missing individual changes. So the reported samples would really be:

smklein commented 8 months ago

The original implementation pre-dated RPWs, anyway, so this should be much easier to do nowadays. The periodic querying could definitely be done as a paginated walk over all "fleets / silos / projects", summing up each group in-memory and reporting the value. By performing a paginated walk, that should avoid any full-table scans.

bnaecker commented 8 months ago

Yeah, that could work. I think we need to maintain the samples in-memory anyway until they're fetched, so another option is to use each new provision operation as an "event" causing us to store a sample. I.e., each time we call into nexus_db_queries::provisioning::Producer::append_all_samples(), we store each new sample. When they're collected, we keep the last one, and report the on the next sample interval if there have been no such events.

I dunno what makes the most sense, we'll see when we get in there.