vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.71k stars 1.57k forks source link

Prometheus stats "stuck" on last value seen for transforms using aggregations (vector_utilization) #20216

Open johnhtodd opened 6 months ago

johnhtodd commented 6 months ago

A note for the community

Problem

It appears that using aggregations creates conditions where vector_utilization Prometheus metrics become "stuck" on the last value seen in a burst of events, even though there are no current events in the event queue. Also, the last value that is shown seems to be unexpectedly high, causing significant monitoring inaccuracies. All transforms downstream from the aggregation seem to show this similar 'stuck' problem.

Configuration

The areas that encompass our configs are still not easily forwarded, as they are still sensitive in some ways.

Here is the general flow of areas

<source> -> remap -> log_to_metric -> aggregation -> metric_to_log -> remap -> kafka

However, we see this behavior on all paths that have "log_to_metric" on a timed aggregate in the path, even simpler paths with no remap at the end.  We do note that only elements downstream of the aggregation transformation show these "stuck" values.

Main configuration:

data_dir: "/var/lib/vector"
expire_metrics_secs: 21900

api:
  enabled: true
  address: "127.0.0.1:8686"

sources:
  vector_metrics:
    type: "internal_metrics"

sinks:
  vector-prometheus:
    type: "prometheus_exporter"
    inputs: ["vector_metrics"]
    address: "[::]:9598"

Version

vector 0.37.0 (x86_64-unknown-linux-gnu 4279bf0 2024-03-26 04:01:58.800851552)

Debug Output

No response

Example Data

No response

Additional Context

We perform intermittent metric_to_log executions, with many minutes (or sometimes hours) in between flushes of the aggregated metrics that are then transformed into log outputs. Some of these logs are then sent through other transforms before finally being sent to a sink.

We track the usage of all activity with the Prometheus metric of "vector_utilization".

We see that during activity (once ever X hours) the vector_utilization values for the various metric_to_log transforms and downstream transforms shows what we would expect as "normal" data, but then just as the final data is pushed out to the sink, the utilization number for each of these metric_to_log and remap components rises dramatically (~.67 but that is variable) and then stays there until the next cycle. It is as if there is a spike in the utilization on the very last event processed through the queue, and then that last value is "remembered" until some next execution. These stuck values all seem to share one component: they are aggregations, with delays in between bursts of events that are sent downstream.

This really throws off our understanding of what is happening with Vector, as it creates several "ghost" components with exceptionally high utilization, which are not actually happening. Or are they? The utilization number stays exactly the same until the next execution, which tells me that these are almost certainly artifacts and not real. However, when we try to do things like add up all the vector_utilization figures on a system to understand behavior, the artifacts make that total entirely impossible to understand.

I enclose a Grafana graph of two lines - the top purple line shows a metric_to_log that is a consumer of an aggregation, and the bottom red line is a remap transform that is a consumer of the metric_to_log. From 15:02:00 to 15:02:30, there was a "flush" of aggregation messages through the pipeline, and we see the dip in the vector_utilization numbers for both remap stanzas. However, then at 15:02:45, the queue was completed, and the last event causes the vector_utilization value for each of the two remaps to jump back up to around .69, and then those number stay exactly consistent until the next flush of aggregation events some hours later.
Screen Shot 2024-03-30 at 9 09 48 AM

References

No response

jszwedko commented 6 months ago

Thanks for filing this @johnhtodd ! We believe this is because the utilization metric publishing is driven by events flowing through the pipeline and so doesn't, for example, publish at 0 when there are no events flowing. I think the fix would include having 0 be published when no events are flowing.

johnhtodd commented 6 months ago

Perhaps a timer of "last observed event" that gets checked and if the timer is >1 second or some other interval then the value is set to zero. I don't know how the value is calculated, so that may be too primitive, but it seems like a quick fix. That also implies that there is some housekeeping loop that even will be able to check the timer - if the values are set only upon an event passing through the component, then this might be more complex to solve.

jszwedko commented 6 months ago

Perhaps a timer of "last observed event" that gets checked and if the timer is >1 second or some other interval then the value is set to zero. I don't know how the value is calculated, so that may be too primitive, but it seems like a quick fix. That also implies that there is some housekeeping loop that even will be able to check the timer - if the values are set only upon an event passing through the component, then this might be more complex to solve.

Agreed, I think the fix could look something like that. If anyone is interested in taking a shot at this, the fix would likely be somewhere in https://github.com/vectordotdev/vector/blob/7b85728c474abc2ff691624ac253ff1777d450b7/src/utilization.rs

esensar commented 6 months ago

Perhaps a timer of "last observed event" that gets checked and if the timer is >1 second or some other interval then the value is set to zero. I don't know how the value is calculated, so that may be too primitive, but it seems like a quick fix. That also implies that there is some housekeeping loop that even will be able to check the timer - if the values are set only upon an event passing through the component, then this might be more complex to solve.

I am not 100% sure, but I think the second is true - I don't think anything happens unless an event is passing, so this might be a bit more complicated then just making a quick fix in utilization.rs. The Utilization component itself is pretty simple and it seems that it is working fine, but the issue is that unless there are events passing through, it is not doing anything, so it is impossible for it to report 0.

johnhtodd commented 6 months ago

This may be a bad idea, so no problem if it is dismissed: How about at every Prometheus scrape, a timestamp is written to each metric. If the metric is updated by an event, that timestamp is erased. If Prometheus checks back on the next scrape, and the timestamp still exists, then Prometheus zeros the metric out or indicates that the metric is "stale" because there have been no "natural" updates? This would mean that a utilization metric would only get "stuck" for one prometheus scrape interval.