oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

Oximeter collector API needs to be paginated #740

Open bnaecker opened 2 years ago

bnaecker commented 2 years ago

The current implementation of the Dropshot endpoint that the oximeter collector uses to scrape metric data is not paginated. It should be, for at least two reasons:

The former is self-explanatory, but the latter deserves some expansion. Right now, oximeter collects all available data from a metric producer. All consumers currently use the oximeter_producer::ProducerRegistry type to do this. When the server gets a request from oximeter, that type pulls all available data from its registered producers, and sends it over the network. At that point, it's gone from the producer.

oximeter does some work on the data, including figuring out which timeseries they belong to, constructing the database records, and inserting them in batches. Should oximeter crash between when the data leaves the producer and the time its inserted into the database, it will be gone for good. Paginating could help prevent this, if it's designed correctly.

The idea (credit to @ahl) would be to have the oximeter collect a limited set of samples. The producer packages them up and sends them in the response, but does not drop them. They are moved to a holding area, along with the page token (or some other identifier) that's included in the response sent to oximeter.

In the happy case, oximeter inserts the data, and then comes back for more, until all the available data is consumed. The producer uses the next page token (or a sequence number in the request) as an ACK from oximeter that it has collected and handled the last page of data.

In the sad case, oximeter will start making requests to the producer that either lack a page token, or that have the same page token repeated. The producer will again serve the last batch of data, which it has kept on hand.

Batching

There's a wrinkle here, which is that oximeter currently batches up data for inserting into ClickHouse. This is just for efficiency, since the documentation is pretty clear (and we have confirmation from friends of Oxide), that data is best updated in large batches. ("Large" varies, but is on the order of thousands, with the goal being only a handful of inserts per second.)

oximeter would need (I think) to change this strategy. In particular, we can't consider data acknowledged when it's in the batch queue, only when it's in the database. That means the producer would need to keep as much data as fits in a batch, not just the last page. I don't think that's a huge deal, since batches are currently inserted every 5 seconds or 1000 records, whichever comes first. But it does mean we might need a way for the producer to hold on to multiple pages, or a way for oximeter to acknowledge multiple pages.

This sort of suggests a "two-pointer" kind of approach. We use the page token to keep track of the front-running pointer, keeping track of the last data oximeter collected from the producer. These all include sequence numbers in them. oximeter then separately acks the set of sequence numbers, or the last one, after inserting it into the database. The producer then drops that data.

Note that unordered IDs might be better than sequence numbers. If the producer dies, I don't believe we want that starting over from zero, since oximeter is somewhere much further along. Although it's feasible oximeter could just take as fiat whatever sequence numbers it gets. I'm not sure.

bnaecker commented 2 years ago

Fleshing out the second part of this issue. The oximeter_producer::Server needs a richer API. In particular, we need an endpoint for collecting a paginated set of samples, as noted above. Then we also need an endpoint for acking and/or nacking that as written to the database. The server can then actually drop the samples. I don't think sequence numbers are the right ID here, since either side may die and reset such a counter. Using the page token, or some other identifier like a UUID is probably better.

If oximeter dies, it will make a request to the producer server that does not include a page token or identifier, which the producer uses as a sentinel to re-serve the collected-but-unacked data. If the producer dies, there are two cases: either oximeter successfully inserted the data or it didn't (oximeter itself died or the database rolled or something else).

If the data was successfully inserted, the request to ack the data can be safely ignored by the new producer. The next request for data will have a page token, but the producer can just ignore that and serve the latest data. If the data was not inserted and oximeter is nack-ing it to the producer, we'd lose the data. This sounds pretty unlikely, since both oximeter and the producer would need to roll at the same time. That's not impossible though, and should we start to hit that, we'll need to investigate a durable storage solution local to the producer if possible.

This still leaves open the question about what happens to collected, but un-acked, data if the producer dies. The