Using prometheus_remote_write sink with distribution results in memory leak

derekhuizhang commented 1 year ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

The distribution metric type is not supported in prometheus. Publishing distribution metrics to a statsd endpoint that routes to prometheus remote write endpoint in memory steadily increasing over time and statsd metrics received dropping off a cliff over time, causing large amounts of metrics to drop.

I haven't tested with other sources so can't say if statsd is the only source that causes this behavior.

Steps to reproduce:

Run config below in docker/kube vector container
Run a statsd firehose that publishes distribution metrics to the port 8125: example: https://github.com/derekhuizhang/statsd-firehose
Check memory usage over time

If you remove the abort statement in the remap config below, the memory usage over time will increase. With the abort statement, the memory usage will stay low.

Configuration

api:
      address: 0.0.0.0:8686
      enabled: true
      playground: false
    data_dir: /var/lib/vector
    sinks:
      prometheus:
        type: prometheus_remote_write
        inputs:
          - remap
        endpoint: <redacted>
        healthcheck:
          enabled: false
        tls: <redacted>
    sources:
      statsd:
        address: 0.0.0.0:8125
        mode: udp
        type: statsd
    transforms:
      remap:
        type: remap
        inputs:
          - statsd
        source: |-
          if .type == "distribution" {
            abort
          }

Version

0.24.0

Debug Output

No response

Example Data

No response

Additional Context

I specifically ran this on Kube with the stateless-aggregator helm chart, but it should have the same effect on Docker

References

No response

bruceg commented 1 year ago

As a sanity check, could I get you to run the same test scenario, but with a blackhole sink instead of prometheus_remote_write. I have a couple of ideas here but would like to confirm we're chasing the right scenarios.

derekhuizhang commented 1 year ago

I've tested this before with blackhole. Blackhole leads to no metrics dropped (consistently high statsd received) and no memory leakage (memory stays low over days).

bruceg commented 1 year ago

Okay, thanks for that. We'll take a look.

derekhuizhang commented 1 year ago

More context:

When the statsd source intakes histogram metrics, it converts them to a distribution type with histogram statistic. I think these are causing the memory leak somewhere bc we are using this type a lot and it gets converted to buckets in the prometheus remote write sink.
On a side note, histogram metrics do not get sent at all to Datadog, neither distribution type w/ histogram statistic or aggregated_histogram type. They all get sent as distributions (ie distribution type with summary as the statistic), without the buckets. https://github.com/vectordotdev/vector/issues/15390

tasinco commented 6 months ago

I have been using vector in the same pattern as describe above.

I have tried sources of both statsd and datadog_agent both ending up with increased memory usage. Using a datadog_agent source does seem to have a slightly lower memory impact.

We are running vector version: v0.36.1

Observations: Restarting the upstream source (i.e. whats sending statsd/datadog_agent message) appears to make the memory increase faster. We do have expire_metrics_secs enabled at a fairly low value (30) with no help. The metrics for vector_utilization for the prometheus_remote_write sink appears to have continuous growth as we send more data.

vectordotdev / vector