vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.61k stars 1.55k forks source link

Using prometheus_remote_write sink with distribution results in memory leak #15295

Open derekhuizhang opened 1 year ago

derekhuizhang commented 1 year ago

A note for the community

Problem

The distribution metric type is not supported in prometheus. Publishing distribution metrics to a statsd endpoint that routes to prometheus remote write endpoint in memory steadily increasing over time and statsd metrics received dropping off a cliff over time, causing large amounts of metrics to drop.

I haven't tested with other sources so can't say if statsd is the only source that causes this behavior.

Screen Shot 2022-11-18 at 12 02 13 PM Screen Shot 2022-11-18 at 12 02 06 PM

Steps to reproduce:

  1. Run config below in docker/kube vector container
  2. Run a statsd firehose that publishes distribution metrics to the port 8125: example: https://github.com/derekhuizhang/statsd-firehose
  3. Check memory usage over time

If you remove the abort statement in the remap config below, the memory usage over time will increase. With the abort statement, the memory usage will stay low.

Configuration

api:
      address: 0.0.0.0:8686
      enabled: true
      playground: false
    data_dir: /var/lib/vector
    sinks:
      prometheus:
        type: prometheus_remote_write
        inputs:
          - remap
        endpoint: <redacted>
        healthcheck:
          enabled: false
        tls: <redacted>
    sources:
      statsd:
        address: 0.0.0.0:8125
        mode: udp
        type: statsd
    transforms:
      remap:
        type: remap
        inputs:
          - statsd
        source: |-
          if .type == "distribution" {
            abort
          }

Version

0.24.0

Debug Output

No response

Example Data

No response

Additional Context

I specifically ran this on Kube with the stateless-aggregator helm chart, but it should have the same effect on Docker

References

No response

bruceg commented 1 year ago

As a sanity check, could I get you to run the same test scenario, but with a blackhole sink instead of prometheus_remote_write. I have a couple of ideas here but would like to confirm we're chasing the right scenarios.

derekhuizhang commented 1 year ago

I've tested this before with blackhole. Blackhole leads to no metrics dropped (consistently high statsd received) and no memory leakage (memory stays low over days).

bruceg commented 1 year ago

Okay, thanks for that. We'll take a look.

derekhuizhang commented 1 year ago

More context:

tasinco commented 6 months ago

I have been using vector in the same pattern as describe above.

I have tried sources of both statsd and datadog_agent both ending up with increased memory usage. Using a datadog_agent source does seem to have a slightly lower memory impact.

We are running vector version: v0.36.1

Observations: Restarting the upstream source (i.e. whats sending statsd/datadog_agent message) appears to make the memory increase faster. We do have expire_metrics_secs enabled at a fairly low value (30) with no help. The metrics for vector_utilization for the prometheus_remote_write sink appears to have continuous growth as we send more data.