Closed clux closed 3 years ago
Well, I've done the propagation..one way. It was hard to retrofit the whole cold/hot shard stuff with an extra optional struct which I don't know how to do atomic style for.
Also not cared for protobuf beyond what was necessary.
But can now do:
let mut exemplar_labels = HashMap::new();
exemplar_labels.insert("traceID".into(), trace_id);
let ex = Exemplar::new_with_labels(duration, exemplar_labels);
reconcile_duration_histogram
.with_label_values(&[])
.observe_with_exemplar(duration, ex);
and text encoder outputs:
# HELP foo_controller_reconcile_duration_seconds The duration of reconcile to complete in seconds
# TYPE foo_controller_reconcile_duration_seconds histogram
foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 7 # {traceID="53b9b11e6aa06934c5f1bcedc772130d"} 37
foo_controller_reconcile_duration_seconds_bucket{le="0.1"} 8
foo_controller_reconcile_duration_seconds_bucket{le="0.25"} 8
have had to subvert the histogramcore sharding a bit, wasn't sure it made any sense to do a hot/cold path for exemplars when you have to sync write the whole struct, but maybe i am wrong.
for now it's sitting straight as a vector of optional exemplars (same length as bucket len) and writes to them are sync, which probably makes the existing sync mechanism useless. not sure how to make this better. could feature flag exemplars?
Is there anything left to do on this PR? Anything I can do to help?
Well, I guess the main thing is that no matter how much I wrangle prometheus, it does not accept my output, even though it looks like it should conform to the spec. I have a demo app that is outputting:
foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1 # {trace_id="6c1011d34f17c4ca1177b40ddfb5184b"} 0.006
foo_controller_reconcile_duration_seconds_bucket{le="0.1"} 1
foo_controller_reconcile_duration_seconds_bucket{le="0.25"} 1
in its /metrics
but prometheus fails to parse it with a:
expected timestamp or new record, got "MNAME"
I have a demo implementation in https://github.com/clux/controller-rs/pull/12, and it's definitely the exemplar output that is failing, so not really sure what to do :-(
Exemplars must have the timestamp after the example value, just like the error says.
@bobrik : The error is asking for a histogram timestamp or a new record, not an exemplar timestamp (which is optional by the spec). These two are both valid:
$ echo 'foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1' | promtool check metrics
foo_controller_reconcile_duration_seconds_bucket no help text
$ echo 'foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1 1623574027' | promtool check metrics
foo_controller_reconcile_duration_seconds_bucket no help text
But none of these are:
$ echo 'foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1 1623574027 # {trace_id="69d991d2ef39b15e71e2f42eb02a9b55"} 0.006"' | promtool check metrics
error while linting: text format parsing error in line 1: spurious string after timestamp: " # {trace_id=\"69d991d2ef39b15e71e2f42eb02a9b55\"} 0.006\""
$ echo 'foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1 1623574027 # {trace_id="69d991d2ef39b15e71e2f42eb02a9b55"} 0.006 1623574027"' | promtool check metrics
error while linting: text format parsing error in line 1: spurious string after timestamp: " # {trace_id=\"69d991d2ef39b15e71e2f42eb02a9b55\"} 0.006 1623574027\""
$ echo 'foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1 1623574027 # ' | promtool check metrics
error while linting: text format parsing error in line 1: spurious string after timestamp: " # "
and without both optional timestamps (orig error):
$ echo 'foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1 # {trace_id="69d991d2ef39b15e71e2f42eb02a9b55"} 0.006"' | promtool check metrics
error while linting: text format parsing error in line 1: expected integer as timestamp, got "#"
Prometheus has 2 metrics format: Prometheus text format, and OpenMetrics.
Only OpenMetrics supports examplars. There are a few other differences between the Text Format and OpenMetrics.
Thanks a lot @roidelapluie ! That solves the mystery.
I can get this client to work with openmetrics format, but it required a different exposition ending (# EOF\n
), and forces a user requirement on the new content-type header, and possibly a different timestamp format (which I started tackling in here, but haven't changed to/verified anywhere).
So while this does seem parseable by prometheus if we skip optional exemplar timestamps and include the header, I am starting to fear that this PR might be a big one to merge if it requires changing the entire content format :(
Edit for anyone is visiting this in the future. I have stopped working on it. This PoC works for histograms, but it needs a lot of work in the library to support two metric formats.
We are not evolving the text format anymore, we want to promote OpenMetrics. If you plan for adding full openmetrics support for Rust, do not hesitate to reach for help. There are a few differences between both formats, and the library ideally would handle both (via negotiation in the headers).
This is a quick draft go at adding exemplar support from #393.
Is this something you would be willing to accept if it was cleaned up? Note that it doesn't compile yet.
Have added them to counters atm, but looks like there are a lot of layers between the Counter and the encoding bits so haven't gotten that far. If this is welcome, then any pointers on how to best propagate the data to the encoders would be appreciated.