open-telemetry / opentelemetry-rust

The Rust OpenTelemetry implementation
https://opentelemetry.io
Apache License 2.0
1.88k stars 438 forks source link

Reset ObservableGauge for all cached attributes sets to 0 #1221

Closed Matthias247 closed 1 year ago

Matthias247 commented 1 year ago

I'm using an ObservableGauge to track errors of certain entities that the application is monitoring might have encountered. The application scans the state of those entities, in periodic intervals, aggregates errors, and then updates gauges according to the latest state. This uses code along:

let mut attrs: Vec<KeyValue> = attributes.to_vec();
attrs.push(KeyValue::new("error", "".to_string()));
let mut total_errs = 0;

for (error, &count) in m.errors_encountered.iter() {
    total_errs += count;
    attrs.last_mut().unwrap().value = error.to_string().into();
    self.errors_gauge
        .observe(otel_cx, count as u64, &attrs);
}

attrs.last_mut().unwrap().value = "any".to_string().into();
self.errors_gauge
    .observe(otel_cx, total_errs as u64, &attrs);

That works fine for all errors that had been recently encountered. However I noticed that once one scan doesn't report a certain error anymore, the opentelemetry-rust/promtheus stack still reports the old error. It would need to be explicetly set to 0.

Is there any mechanism in opentelemetry-rust that allows to reset a gauge (with all variations of attributes) to 0 before updating them again?

If there would be a well-defined set of errors, I could obviously manually update the values that are not part of errors_encountered to 0. But since those errors are dynamic strings that are received from another application that isn't easily possible.

Maybe the right answer here is also "you are doing it wrong and shouldn't use gauges for it", which is certainly debatable :) But for this particular problem where the exact amount of entities in a certain state should be determined independently of the report frequency and without metric math, they seem much easier to use.

Matthias247 commented 1 year ago

I think this might be the equivalent method in prometheus directly? https://docs.rs/prometheus/latest/prometheus/core/struct.MetricVec.html#method.reset

Matthias247 commented 1 year ago

I've did some experimenting and I'm now wondering whether this behavior changed in 0.20 towards what I expect.

In 0.19 I built the following unit-test:

#[test]
fn test_logging_setup() {
    let metrics_controller = metrics::controllers::basic(metrics::processors::factory(
        metrics::selectors::simple::histogram([1.0, 10.0]),
        aggregation::cumulative_temporality_selector(),
    ))
    .with_collect_period(std::time::Duration::from_secs(0))
    .build();

    let metrics_exporter = Arc::new(opentelemetry_prometheus::exporter(metrics_controller).init());

    let meter = metrics_exporter.meter_provider().unwrap().meter("myservice");
    let x = meter.u64_observable_gauge("mygauge").init();

    let state = KeyValue::new("state", "mystate");
    let p1 = vec![state.clone(), KeyValue::new("error", "ErrA")];
    let p2 = vec![state.clone(), KeyValue::new("error", "ErrB")];
    let p3 = vec![state.clone(), KeyValue::new("error", "ErrC")];

    let counter = Arc::new(AtomicUsize::new(0));

    meter.register_callback(move |cx| {
        let count = counter.fetch_add(1, Ordering::SeqCst);
        println!("Collection {}", count);
        if count % 2 == 0 {
            x.observe(&cx, 1, &p1);
        } else{
            x.observe(&cx, 1, &p2);
        }
        if count % 3 == 1 {
            x.observe(&cx, 1, &p3);
        }

    }).unwrap();

    for _ in 0..10 {
        let mut buffer = vec![];
        let encoder = TextEncoder::new();
        let metric_families = metrics_exporter.registry().gather();
        encoder.encode(&metric_families, &mut buffer).unwrap();
        println!("{}", String::from_utf8(buffer).unwrap());
    }

    panic!("failed");
}

This provides the following output (collapsed extracted just the gauges for brevity):

Click to expand ``` Collection 0 # HELP mygauge mygauge # TYPE mygauge gauge mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 Collection 1 # HELP mygauge mygauge # TYPE mygauge gauge mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 Collection 2 # HELP mygauge mygauge # TYPE mygauge gauge mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 Collection 3 # HELP mygauge mygauge # TYPE mygauge gauge mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 Collection 4 # HELP mygauge mygauge # TYPE mygauge gauge mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 Collection 5 # HELP mygauge mygauge # TYPE mygauge gauge mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 Collection 6 # HELP mygauge mygauge # TYPE mygauge gauge mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 Collection 7 # HELP mygauge mygauge # TYPE mygauge gauge mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 Collection 8 # HELP mygauge mygauge # TYPE mygauge gauge mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 Collection 9 # HELP mygauge mygauge # TYPE mygauge gauge mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1 ```

As the output shows, every scrape encodes all 3 metrics:

mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1

With 0.20, the equivalent code snippet seems this:

#[test]
fn test_logging_setup() {
    let prometheus_registry = prometheus::Registry::new();
    let metrics_exporter = opentelemetry_prometheus::exporter()
        .with_registry(prometheus_registry.clone())
        .build().unwrap();
    let meter_provider = metrics::MeterProvider::builder()
        .with_reader(metrics_exporter)
        .build();

    let meter = meter_provider.meter("myservice");
    let x = meter.u64_observable_gauge("mygauge").init();

    let state = KeyValue::new("state", "mystate");
    let p1 = vec![state.clone(), KeyValue::new("error", "ErrA")];
    let p2 = vec![state.clone(), KeyValue::new("error", "ErrB")];
    let p3 = vec![state.clone(), KeyValue::new("error", "ErrC")];

    let counter = Arc::new(AtomicUsize::new(0));

    meter.register_callback(&[x.as_any()], move |observer| {
        let count = counter.fetch_add(1, Ordering::SeqCst);
        println!("Collection {}", count);
        if count % 2 == 0 {
            observer.observe_u64(&x, 1, &p1);
        } else{
            observer.observe_u64(&x, 1, &p2);
        }
        if count % 3 == 1 {
            observer.observe_u64(&x, 1, &p3);
        }

    }).unwrap();

    for _ in 0..10 {
        let mut buffer = vec![];
        let encoder = TextEncoder::new();
        let metric_families = prometheus_registry.gather();
        encoder.encode(&metric_families, &mut buffer).unwrap();
        println!("{}", String::from_utf8(buffer).unwrap());
    }

    panic!("failed");
}

This provides the following output (collapsed extracted just the gauges for brevity):

Click to expand ``` Collection 0 # TYPE mygauge gauge mygauge{error="ErrA",state="mystate",otel_scope_name="myservice"} 1 Collection 1 # TYPE mygauge gauge mygauge{error="ErrB",state="mystate",otel_scope_name="myservice"} 1 mygauge{error="ErrC",state="mystate",otel_scope_name="myservice"} 1 Collection 2 # TYPE mygauge gauge mygauge{error="ErrA",state="mystate",otel_scope_name="myservice"} 1 Collection 3 # TYPE mygauge gauge mygauge{error="ErrB",state="mystate",otel_scope_name="myservice"} 1 Collection 4 # TYPE mygauge gauge mygauge{error="ErrA",state="mystate",otel_scope_name="myservice"} 1 mygauge{error="ErrC",state="mystate",otel_scope_name="myservice"} 1 Collection 5 # TYPE mygauge gauge mygauge{error="ErrB",state="mystate",otel_scope_name="myservice"} 1 Collection 6 # TYPE mygauge gauge mygauge{error="ErrA",state="mystate",otel_scope_name="myservice"} 1 Collection 7 # TYPE mygauge gauge mygauge{error="ErrB",state="mystate",otel_scope_name="myservice"} 1 mygauge{error="ErrC",state="mystate",otel_scope_name="myservice"} 1 Collection 8 # TYPE mygauge gauge mygauge{error="ErrA",state="mystate",otel_scope_name="myservice"} 1 Collection 9 # TYPE mygauge gauge mygauge{error="ErrB",state="mystate",otel_scope_name="myservice"} 1 ```

Here the exported metrics match what I expect. Only the gauge values for attributes that have been submitted in the last callback are retained. But I'm not sure what prometheus will actually make out of it (reset non-submitted values to 0 or not), but I will figure it out.

Is this change in behavior expected and was a bugfix for 0.20? Or was the setup code for 0.19 - which I mostly copied from examples - instructing the library to behave this way. The description of #1000 doesn't seem to explain such a behavior change.

cijothomas commented 1 year ago

0.20 is a near complete rewrite of Metrics API/SDK, as the older implementation was based on an OLD, experimental version of spec itself. The new one matches the stable spec, so if you are getting what you need with the new one, that is awesome news!

TommyCpp commented 1 year ago

It seems to be related to https://github.com/open-telemetry/opentelemetry-rust/issues/955. I think the issue has been patched in v0.20 so will close this out for now. Feel free to reopen if you have other questions/comments