oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

Metrics: hardware_component:poll_error_count query returns timeseries not found error in some cases #6709

Open askfongjojo opened 2 months ago

askfongjojo commented 2 months ago

A query against the component polling error metrics returns the following HTTP 400 error on rack2:

$ oxide experimental timeseries query --query 'get hardware_component:poll_error_count'
error
Error Response: status: 400 Bad Request; headers: {"content-type": "application/json", "x-request-id": "9692e4d9-0633-470b-b90b-6652d8e75ece", "content-length": "170", "date": "Fri, 27 Sep 2024 14:29:56 GMT"}; value: Error { error_code: Some("InvalidRequest"), message: "Timeseries not found for: hardware_component:poll_error_count", request_id: "9692e4d9-0633-470b-b90b-6652d8e75ece" }

The problem was also seen with a rackettle but wasn't with rack3 (where the query returned one row of data). It would appear that when there isn't any polling error, instead of inserting a zero count, the producer skips it altogether.

bnaecker commented 2 months ago

Thanks for filing this! The behavior here is one of the reasons I want to move the schema for all timeseries into CRDB and then populate them in ClickHouse from there. Today, the oximeter collector derives a schema from every sample it collects, and inserts those into ClickHouse. As you pointed out, if a producer doesn't generate a sample, there will never be a schema for it! I'd rather it were not up to the producer, since we know at the time the software is built the schema that are available, even if those tables remain empty forever.