oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
243 stars 36 forks source link

Want sled identifiers on most (all?) timeseries #5267

Open bnaecker opened 6 months ago

bnaecker commented 6 months ago

Most timeseries today do not include information about the sled from which they're derived. They should! The physical data link stats are one example. They include the sled's UUID and serial right now. But serial numbers are only unique within a part number, and so we'd probably want to add the part and revision to these as well.

Note that this is probably not restricted just to physical or host-specific timeseries. For example, it would be very useful to understand Crucible virtual disk write statistics broken down by the host sled. Ditto for vCPU data, or VNIC usage (both guest and host).

Implementation note

The oximeter derive-macros for implementing Target and Metric could benefit from a bit more care. I've experimented previously with adding units through attributes, for example. One improvement that could help implement this current issue is something like #[serde(flatten)] or #[diesel(embed)], i.e., a way to include some other struct and its fields in this one. We could then make a set of consistent, widely-used targets in a place like oximeter-instruments, for including common fields. So something like:

/// Shared identifiers for a sled
struct Sled {
    part: String,
    revision: String,
    serial: String,
    id: Uuid,
}

#[derive(oximeter::Target)]
struct SomeTarget {
    #[oximeter(embed)]
    sled: Sled,
}

One option here, though it's a bit magical, might be to prefix the field names of the embedded struct with the field name in the containing struct. So the fields for SomeTarget would be "sled_part", "sled_revision", etc.

leftwo commented 3 weeks ago

I have seen customers request access to more information about where their instances are located, and it seems like having the sled ID where stats were collected is related to this.

Would adding (or removing) it in the future cause problems with data already collected? Like, could we change our minds?

bnaecker commented 3 weeks ago

could we change our minds?

Yes, we can change our minds. Today, the way to do that is to expunge the old schema and any data during an offline upgrade. That is a short-term solution I put in place to make it easier to explore these kinds of choices.

In the mid- to long-term, I'm working to support more careful, online updates to the schema. Right now, it's not entirely clear how to do that, since the choices are (1) have multiple versions of the schema or (2) perform some kind of data migration to put old data into the new format. Each has significant downsides, and I'm not sure I have enough information to choose right now.

bnaecker commented 3 weeks ago

I think I'm coming down on the side of keeping these fields out of the timeseries for now, with the option to add them later if we need. It might be useful to know the physical resources backing any virtual ones, but I think it's going to be highly dependent on the customer and situation. In the meantime, having them now feels like a way to leak information that the rest of the Omicron API has no way to expose -- you can't get the ID of the sled hosting any particular instance today through the various instance-viewing endpoints, for example.

In an ideal world, we would be able to report these fields, but filter them out from the results as needed depending on the user viewing the data. In practice, that seems plausible but difficult to do with the model we have today, and it's not clear if any customers (or us) need it at this point. In the meantime, we have other ways of mapping a virtual-to-physical resources, notably omdb, so we can get this data even if it is (1) difficult and (2) not necessarily a complete historical record.