vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.81k stars 1.58k forks source link

Vector-agent hits OOM every hour #21655

Open st-omarkhalid opened 6 days ago

st-omarkhalid commented 6 days ago

A note for the community

Problem

Vector-agent in our deployment shows constant memory growth till the pod hits OOM. This is happening continuously. I looked at a number of other issues open already on the same problem but it's not clear how to resolve it. In prod we have a lot more pipelines than shown below though.

The metric vector_component_allocated_bytes shows remap-* components have the most memory allocation and constantly growing.

Configuration

data_dir: /var/lib/vector
expire_metrics_secs: 600
api:
  enabled: true
  address: 0.0.0.0:8686
  playground: true
sources:
  host_metric_source:
    type: host_metrics
    scrape_interval_secs: 60
  internal_metric_source:
    type: internal_metrics
    scrape_interval_secs: 60
  ip-node1:
    type: prometheus_scrape
    endpoints:
    - http://<ip>:9100/metrics
    scrape_interval_secs: 60
sinks:
  datadog_sink:
    type: datadog_metrics
    inputs:
    - remap_pod_*
    site: datadoghq.com
  prometheus_exporter:
    type: prometheus_exporter
    inputs:
    - internal_metric_source
    - host_metric_source
    address: 0.0.0.0:9598
  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: https://<host>:8903
    buffer:
      type: disk
      when_full: drop_newest
      max_size: 10737418240
    batch:
      max_bytes: 2500000
      timeout_secs: 15
    healthcheck: true
    compression: true
    headers:
      auth-version: "1"
transforms:
  remap_node_ip-node1:
    type: remap
    inputs:
    - ip-node1
    source: |-
      .tags.nodename = "ip-node1"
      .tags.service_name = "nodeExporter"
  tiering:
    type: route
    inputs:
    - remap_*
    route:
      tier1: .name == .name
    reroute_unmatched: false

Version

0.35.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

jszwedko commented 5 days ago

Hey! Diagnosing memory issues in Vector can be tricky. A few questions that may help:

st-omarkhalid commented 5 days ago

Thanks @jszwedko for the response. We don't have any limits applied but the pod typically hit OOM around 29GB. Yes vector_adaptive_concurrency_limit_count does correlate with the memory growth. Screenshot 2024-10-30 at 9 12 11 AM

jszwedko commented 5 days ago

Interesting, thanks for sharing that graph. It seems likely to me that the issue is that the concurrency limit is never finding a max. You could try to configure a max via request.adaptive_concurrency.max_concurrency_limit (e.g. https://vector.dev/docs/reference/configuration/sinks/vector/#request.adaptive_concurrency.max_concurrency_limit).

st-omarkhalid commented 5 days ago

That did not help. Did I get it right?


  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: https://<host>:8903
    request:
      adaptive_concurrency:
        max_concurrency_limit: 500
jszwedko commented 5 days ago

That did not help. Did I get it right?

  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: https://<host>:8903
    request:
      adaptive_concurrency:
        max_concurrency_limit: 500

That looks right. Did you observe adaptive_concurrency_limit not exceeding this limit?

st-omarkhalid commented 5 days ago

It went past it and the memory growth had the same behavior as before. Screenshot 2024-10-30 at 12 58 13 PM

@jszwedko Wait - looking at the breakdown per sink type it looks like most of the concurrency_limit are coming from another sink which does not have the setting. Let me update the other sink and report back.

st-omarkhalid commented 5 days ago

@jszwedko I set the data-dog sink at a limit of 100 but it's at 1K after 50 mins and memory has grown linearly as well.

jszwedko commented 5 days ago

@jszwedko I set the data-dog sink at a limit of 100 but it's at 1K after 50 mins and memory has grown linearly as well.

Hmm, can you share the config you are trying for the Datadog Logs sink?

st-omarkhalid commented 5 days ago

Here


  datadog_sink:
    type: datadog_metrics
    inputs:
    - remap_pod_*
    default_api_key: <key>
    site: datadoghq.com
    request:
      adaptive_concurrency:
        max_concurrency_limit: 100
jszwedko commented 5 days ago

That looks right. Are you confident that that is the sink that is exceeding the limit? All of the others are respecting it?

st-omarkhalid commented 5 days ago

Here - vector gets to 55-65 and dd-metrics gets to 1k before the pod hit OOM. Screenshot 2024-10-30 at 3 10 35 PM

jszwedko commented 5 days ago

Gotcha, thanks! That does look like it is exceeding the max. I'm having trouble reproducing this behavior locally though 😢

I'm running this config:

sources:
  source0:
    namespace: vector
    scrape_interval_secs: 0.1
    type: internal_metrics
  source1:
    namespace: vector
    scrape_interval_secs: 0.1
    type: internal_metrics
sinks:
  sink0:
    inputs:
    - source0
    type: datadog_metrics
    batch:
      max_events: 1
    request:
      adaptive_concurrency:
        initial_concurrency: 1
        max_concurrency_limit: 5
  sink1:
    inputs:
    - source1
    type: datadog_metrics
    request:
      adaptive_concurrency:
        initial_concurrency: 1
        max_concurrency_limit: 5

For source0 I'm limiting batch.max_events to 1 to cause it to make more requests than it otherwise would. I see the max for sink0 bubbling around 6. If I remove the limit it goes up to 50. I tried with both Vector v0.35.0 and v0.42.0.

st-omarkhalid commented 5 days ago

For a moment I thought the limits had improved but that's still not the case. There were a couple of evicted pods which had lower limits but otherwise the limits are way over 100 for dd sink. Screenshot 2024-10-30 at 5 44 08 PM For context we have over 300 sources (each with its own transform) in the config.

jszwedko commented 5 days ago

Is that graph cumulative across multiple nodes / sinks? E.g. do you have multiple datadog_metrics sinks in the same config?

st-omarkhalid commented 5 days ago

That graph is simply vector_adaptive_concurrency_limit_count plotted with no filters. We have a single vector-agent pod in our setup. The sawtooth pattern above is because the pod is getting OOM'd and replaced by a new one. The config has 3 sinks as pointed out in the issue description.

@jszwedko How long did you run your test? Was the limit stable at 6 for long time?

st-omarkhalid commented 4 days ago

I looked at another instance of the vector-agent in a different cluster (where again the limit is not applied). The limit reached to 15K without hitting OOM - the number of sources in that cluster is a lot less but it still has 3 sinks.

On the problematic vector-agent instance I see large memory allocation across remap components (vector_component_allocated_bytes) Screenshot 2024-10-31 at 6 00 38 PM

jszwedko commented 4 days ago

That graph is simply vector_adaptive_concurrency_limit_count plotted with no filters. We have a single vector-agent pod in our setup.

I'm not sure you want to plot _count, this would be the count of the number of data points in the histogram. Could you plot the max instead?

I only ran my test for maybe 10 minutes. I can try another run.

st-omarkhalid commented 4 days ago

@jszwedko You mean max(vector_adaptive_concurrency_limit_count) - otherwise I don't see max metric for it. Screenshot 2024-10-31 at 9 50 24 AM

jszwedko commented 4 days ago

@jszwedko You mean max(vector_adaptive_concurrency_limit_count) - otherwise I don't see max metric for it. Screenshot 2024-10-31 at 9 50 24 AM

Hmm. What system are you sending these metrics into? vector_adaptive_concurrency_limit is a histogram so you should be able to do things like taking the max, min, and average.

st-omarkhalid commented 4 days ago

The buckets are present. Is histogram_quantile(1, increase(vector_adaptive_concurrency_limit_bucket[1h])) something you want to see? Screenshot 2024-10-31 at 10 40 14 AM

jszwedko commented 4 days ago

The buckets are present. Is histogram_quantile(1, increase(vector_adaptive_concurrency_limit_bucket[1h])) something you want to see?

Nice, yeah, I see histogram_quantile(1, ...) gives you the max value. There we can see it does seem to be staying below or around the limit of 100.

My next hypothesis is that there might be backpressure causing requests to queue up in the source waiting to flush data downstream. Could you share a graph of vector_component_utilization and buffer_send_duration_seconds per component?

st-omarkhalid commented 4 days ago

Utilization has an interesting pattern - it spikes in the remap component and then settles down. Screenshot 2024-10-31 at 11 55 45 AM

st-omarkhalid commented 4 days ago

Screenshot 2024-10-31 at 12 28 02 PM sum(histogram_quantile(0.99, increase(vector_buffer_send_duration_seconds_bucket[10m]))) by (component_type)

Is there a more explicit metric on the queue size? Maybe https://vector.dev/docs/reference/configuration/sources/internal_metrics/#buffer_events Screenshot 2024-10-31 at 12 36 24 PM

st-omarkhalid commented 3 days ago

The buffer size is pretty stable as well. Screenshot 2024-11-01 at 9 53 19 AM

I also checked the ingress/egress difference and that consistent as well Screenshot 2024-11-01 at 9 50 25 AM

This does appear like a mem leak issue.

jszwedko commented 3 days ago

Yeah, it could be, though your config is fairly simple and it'd be surprising that if you had a leak that it wasn't affecting a large number of users. I'll try to think about this some more, but if you are able to grab a memory profile using valgrind that could be helpful to narrow down where the memory is being used.

st-omarkhalid commented 1 day ago

vector doesn't even run for me with valgrind - it fails with sig-fault.


==20819== Memcheck, a memory error detector
==20819== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==20819== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==20819== Command: vector/target/release/vector
==20819==
==20819== Invalid read of size 4
==20819==    at 0x3EBDE6A: _rjem_je_tcache_bin_flush_small (in /home/admin/vector/target/release/vector)
==20819==    by 0x3E895B8: _rjem_je_sdallocx_default (in /home/admin/vector/target/release/vector)
==20819==    by 0x2CE09D8: core::ptr::drop_in_place<clap_builder::parser::matches::matched_arg::MatchedArg> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B6AB51: core::ptr::drop_in_place<clap_builder::util::flat_map::FlatMap<clap_builder::util::id::Id,clap_builder::parser::matches::matched_arg::MatchedArg>> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B99EFB: core::ptr::drop_in_place<clap_builder::parser::matches::arg_matches::ArgMatches> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3C68D4C: vector::cli::Opts::get_matches (in /home/admin/vector/target/release/vector)
==20819==    by 0x14092CE: vector::main (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EC2: std::sys::backtrace::__rust_begin_short_backtrace (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EB8: std::rt::lang_start::{{closure}} (in /home/admin/vector/target/release/vector)
==20819==    by 0x8BC199F: std::rt::lang_start_internal (in /home/admin/vector/target/release/vector)
==20819==    by 0x140A434: main (in /home/admin/vector/target/release/vector)
==20819==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==20819==
==20819==
==20819== Process terminating with default action of signal 11 (SIGSEGV)
==20819==  Access not within mapped region at address 0x64
==20819==    at 0x3EBDE6A: _rjem_je_tcache_bin_flush_small (in /home/admin/vector/target/release/vector)
==20819==    by 0x3E895B8: _rjem_je_sdallocx_default (in /home/admin/vector/target/release/vector)
==20819==    by 0x2CE09D8: core::ptr::drop_in_place<clap_builder::parser::matches::matched_arg::MatchedArg> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B6AB51: core::ptr::drop_in_place<clap_builder::util::flat_map::FlatMap<clap_builder::util::id::Id,clap_builder::parser::matches::matched_arg::MatchedArg>> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B99EFB: core::ptr::drop_in_place<clap_builder::parser::matches::arg_matches::ArgMatches> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3C68D4C: vector::cli::Opts::get_matches (in /home/admin/vector/target/release/vector)
==20819==    by 0x14092CE: vector::main (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EC2: std::sys::backtrace::__rust_begin_short_backtrace (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EB8: std::rt::lang_start::{{closure}} (in /home/admin/vector/target/release/vector)
==20819==    by 0x8BC199F: std::rt::lang_start_internal (in /home/admin/vector/target/release/vector)
==20819==    by 0x140A434: main (in /home/admin/vector/target/release/vector)
==20819==  If you believe this happened as a result of a stack
==20819==  overflow in your program's main thread (unlikely but
==20819==  possible), you can try to increase the size of the
==20819==  main thread stack using the --main-stacksize= flag.
==20819==  The main thread stack size used in this run was 108392448.
==20819==
==20819== HEAP SUMMARY:
==20819==     in use at exit: 172,057 bytes in 471 blocks
==20819==   total heap usage: 478 allocs, 7 frees, 174,129 bytes allocated
==20819==
==20819== LEAK SUMMARY:
==20819==    definitely lost: 0 bytes in 0 blocks
==20819==    indirectly lost: 0 bytes in 0 blocks
==20819==      possibly lost: 0 bytes in 0 blocks
==20819==    still reachable: 172,057 bytes in 471 blocks
==20819==         suppressed: 0 bytes in 0 blocks
==20819== Reachable blocks (those to which a pointer was found) are not shown.
==20819== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==20819==
==20819== For lists of detected and suppressed errors, rerun with: -s
==20819== ERROR SUMMARY: 2 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault