Vector-agent hits OOM every hour

st-omarkhalid commented 3 weeks ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Vector-agent in our deployment shows constant memory growth till the pod hits OOM. This is happening continuously. I looked at a number of other issues open already on the same problem but it's not clear how to resolve it. In prod we have a lot more pipelines than shown below though.

The metric vector_component_allocated_bytes shows remap-* components have the most memory allocation and constantly growing.

Configuration

data_dir: /var/lib/vector
expire_metrics_secs: 600
api:
  enabled: true
  address: 0.0.0.0:8686
  playground: true
sources:
  host_metric_source:
    type: host_metrics
    scrape_interval_secs: 60
  internal_metric_source:
    type: internal_metrics
    scrape_interval_secs: 60
  ip-node1:
    type: prometheus_scrape
    endpoints:
    - http://<ip>:9100/metrics
    scrape_interval_secs: 60
sinks:
  datadog_sink:
    type: datadog_metrics
    inputs:
    - remap_pod_*
    site: datadoghq.com
  prometheus_exporter:
    type: prometheus_exporter
    inputs:
    - internal_metric_source
    - host_metric_source
    address: 0.0.0.0:9598
  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: https://<host>:8903
    buffer:
      type: disk
      when_full: drop_newest
      max_size: 10737418240
    batch:
      max_bytes: 2500000
      timeout_secs: 15
    healthcheck: true
    compression: true
    headers:
      auth-version: "1"
transforms:
  remap_node_ip-node1:
    type: remap
    inputs:
    - ip-node1
    source: |-
      .tags.nodename = "ip-node1"
      .tags.service_name = "nodeExporter"
  tiering:
    type: route
    inputs:
    - remap_*
    route:
      tier1: .name == .name
    reroute_unmatched: false

Version

0.35.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

jszwedko commented 3 weeks ago

Hey! Diagnosing memory issues in Vector can be tricky. A few questions that may help:

What memory limits do you have applied?
If you increase them, does the usage level out, or do you see it grow consistently regardless of the limit (indicating some sort of leak)?
Is it correlated with the adaptive_concurrency_limit metric? I'm wondering if an increasing number of concurrent requests are causing pressure

st-omarkhalid commented 3 weeks ago

Thanks @jszwedko for the response. We don't have any limits applied but the pod typically hit OOM around 29GB. Yes vector_adaptive_concurrency_limit_count does correlate with the memory growth. Screenshot 2024-10-30 at 9 12 11 AM

jszwedko commented 3 weeks ago

Interesting, thanks for sharing that graph. It seems likely to me that the issue is that the concurrency limit is never finding a max. You could try to configure a max via request.adaptive_concurrency.max_concurrency_limit (e.g. https://vector.dev/docs/reference/configuration/sinks/vector/#request.adaptive_concurrency.max_concurrency_limit).

st-omarkhalid commented 3 weeks ago

That did not help. Did I get it right?


  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: https://<host>:8903
    request:
      adaptive_concurrency:
        max_concurrency_limit: 500

jszwedko commented 3 weeks ago

That did not help. Did I get it right?

  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: https://<host>:8903
    request:
      adaptive_concurrency:
        max_concurrency_limit: 500

That looks right. Did you observe adaptive_concurrency_limit not exceeding this limit?

st-omarkhalid commented 3 weeks ago

It went past it and the memory growth had the same behavior as before. Screenshot 2024-10-30 at 12 58 13 PM

@jszwedko Wait - looking at the breakdown per sink type it looks like most of the concurrency_limit are coming from another sink which does not have the setting. Let me update the other sink and report back.

st-omarkhalid commented 3 weeks ago

@jszwedko I set the data-dog sink at a limit of 100 but it's at 1K after 50 mins and memory has grown linearly as well.

jszwedko commented 3 weeks ago

@jszwedko I set the data-dog sink at a limit of 100 but it's at 1K after 50 mins and memory has grown linearly as well.

Hmm, can you share the config you are trying for the Datadog Logs sink?

st-omarkhalid commented 3 weeks ago

Here


  datadog_sink:
    type: datadog_metrics
    inputs:
    - remap_pod_*
    default_api_key: <key>
    site: datadoghq.com
    request:
      adaptive_concurrency:
        max_concurrency_limit: 100

jszwedko commented 3 weeks ago

That looks right. Are you confident that that is the sink that is exceeding the limit? All of the others are respecting it?

st-omarkhalid commented 3 weeks ago

Here - vector gets to 55-65 and dd-metrics gets to 1k before the pod hit OOM. Screenshot 2024-10-30 at 3 10 35 PM

jszwedko commented 3 weeks ago

Gotcha, thanks! That does look like it is exceeding the max. I'm having trouble reproducing this behavior locally though 😢

I'm running this config:

sources:
  source0:
    namespace: vector
    scrape_interval_secs: 0.1
    type: internal_metrics
  source1:
    namespace: vector
    scrape_interval_secs: 0.1
    type: internal_metrics
sinks:
  sink0:
    inputs:
    - source0
    type: datadog_metrics
    batch:
      max_events: 1
    request:
      adaptive_concurrency:
        initial_concurrency: 1
        max_concurrency_limit: 5
  sink1:
    inputs:
    - source1
    type: datadog_metrics
    request:
      adaptive_concurrency:
        initial_concurrency: 1
        max_concurrency_limit: 5

For source0 I'm limiting batch.max_events to 1 to cause it to make more requests than it otherwise would. I see the max for sink0 bubbling around 6. If I remove the limit it goes up to 50. I tried with both Vector v0.35.0 and v0.42.0.

st-omarkhalid commented 3 weeks ago

For a moment I thought the limits had improved but that's still not the case. There were a couple of evicted pods which had lower limits but otherwise the limits are way over 100 for dd sink. Screenshot 2024-10-30 at 5 44 08 PM For context we have over 300 sources (each with its own transform) in the config.

jszwedko commented 3 weeks ago

Is that graph cumulative across multiple nodes / sinks? E.g. do you have multiple datadog_metrics sinks in the same config?

st-omarkhalid commented 3 weeks ago

That graph is simply vector_adaptive_concurrency_limit_count plotted with no filters. We have a single vector-agent pod in our setup. The sawtooth pattern above is because the pod is getting OOM'd and replaced by a new one. The config has 3 sinks as pointed out in the issue description.

@jszwedko How long did you run your test? Was the limit stable at 6 for long time?

st-omarkhalid commented 3 weeks ago

I looked at another instance of the vector-agent in a different cluster (where again the limit is not applied). The limit reached to 15K without hitting OOM - the number of sources in that cluster is a lot less but it still has 3 sinks.

On the problematic vector-agent instance I see large memory allocation across remap components (vector_component_allocated_bytes) Screenshot 2024-10-31 at 6 00 38 PM

jszwedko commented 3 weeks ago

That graph is simply vector_adaptive_concurrency_limit_count plotted with no filters. We have a single vector-agent pod in our setup.

I'm not sure you want to plot _count, this would be the count of the number of data points in the histogram. Could you plot the max instead?

I only ran my test for maybe 10 minutes. I can try another run.

st-omarkhalid commented 3 weeks ago

@jszwedko You mean max(vector_adaptive_concurrency_limit_count) - otherwise I don't see max metric for it. Screenshot 2024-10-31 at 9 50 24 AM

jszwedko commented 3 weeks ago

@jszwedko You mean max(vector_adaptive_concurrency_limit_count) - otherwise I don't see max metric for it.

Hmm. What system are you sending these metrics into? vector_adaptive_concurrency_limit is a histogram so you should be able to do things like taking the max, min, and average.

st-omarkhalid commented 3 weeks ago

The buckets are present. Is histogram_quantile(1, increase(vector_adaptive_concurrency_limit_bucket[1h])) something you want to see? Screenshot 2024-10-31 at 10 40 14 AM

jszwedko commented 3 weeks ago

The buckets are present. Is histogram_quantile(1, increase(vector_adaptive_concurrency_limit_bucket[1h])) something you want to see?

Nice, yeah, I see histogram_quantile(1, ...) gives you the max value. There we can see it does seem to be staying below or around the limit of 100.

My next hypothesis is that there might be backpressure causing requests to queue up in the source waiting to flush data downstream. Could you share a graph of vector_component_utilization and buffer_send_duration_seconds per component?

st-omarkhalid commented 3 weeks ago

Utilization has an interesting pattern - it spikes in the remap component and then settles down. Screenshot 2024-10-31 at 11 55 45 AM

st-omarkhalid commented 3 weeks ago

Screenshot 2024-10-31 at 12 28 02 PM sum(histogram_quantile(0.99, increase(vector_buffer_send_duration_seconds_bucket[10m]))) by (component_type)

Is there a more explicit metric on the queue size? Maybe https://vector.dev/docs/reference/configuration/sources/internal_metrics/#buffer_events Screenshot 2024-10-31 at 12 36 24 PM

st-omarkhalid commented 3 weeks ago

The buffer size is pretty stable as well. Screenshot 2024-11-01 at 9 53 19 AM

I also checked the ingress/egress difference and that consistent as well Screenshot 2024-11-01 at 9 50 25 AM

This does appear like a mem leak issue.

jszwedko commented 3 weeks ago

Yeah, it could be, though your config is fairly simple and it'd be surprising that if you had a leak that it wasn't affecting a large number of users. I'll try to think about this some more, but if you are able to grab a memory profile using valgrind that could be helpful to narrow down where the memory is being used.

st-omarkhalid commented 3 weeks ago

vector doesn't even run for me with valgrind - it fails with sig-fault.


==20819== Memcheck, a memory error detector
==20819== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==20819== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==20819== Command: vector/target/release/vector
==20819==
==20819== Invalid read of size 4
==20819==    at 0x3EBDE6A: _rjem_je_tcache_bin_flush_small (in /home/admin/vector/target/release/vector)
==20819==    by 0x3E895B8: _rjem_je_sdallocx_default (in /home/admin/vector/target/release/vector)
==20819==    by 0x2CE09D8: core::ptr::drop_in_place<clap_builder::parser::matches::matched_arg::MatchedArg> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B6AB51: core::ptr::drop_in_place<clap_builder::util::flat_map::FlatMap<clap_builder::util::id::Id,clap_builder::parser::matches::matched_arg::MatchedArg>> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B99EFB: core::ptr::drop_in_place<clap_builder::parser::matches::arg_matches::ArgMatches> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3C68D4C: vector::cli::Opts::get_matches (in /home/admin/vector/target/release/vector)
==20819==    by 0x14092CE: vector::main (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EC2: std::sys::backtrace::__rust_begin_short_backtrace (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EB8: std::rt::lang_start::{{closure}} (in /home/admin/vector/target/release/vector)
==20819==    by 0x8BC199F: std::rt::lang_start_internal (in /home/admin/vector/target/release/vector)
==20819==    by 0x140A434: main (in /home/admin/vector/target/release/vector)
==20819==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==20819==
==20819==
==20819== Process terminating with default action of signal 11 (SIGSEGV)
==20819==  Access not within mapped region at address 0x64
==20819==    at 0x3EBDE6A: _rjem_je_tcache_bin_flush_small (in /home/admin/vector/target/release/vector)
==20819==    by 0x3E895B8: _rjem_je_sdallocx_default (in /home/admin/vector/target/release/vector)
==20819==    by 0x2CE09D8: core::ptr::drop_in_place<clap_builder::parser::matches::matched_arg::MatchedArg> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B6AB51: core::ptr::drop_in_place<clap_builder::util::flat_map::FlatMap<clap_builder::util::id::Id,clap_builder::parser::matches::matched_arg::MatchedArg>> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B99EFB: core::ptr::drop_in_place<clap_builder::parser::matches::arg_matches::ArgMatches> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3C68D4C: vector::cli::Opts::get_matches (in /home/admin/vector/target/release/vector)
==20819==    by 0x14092CE: vector::main (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EC2: std::sys::backtrace::__rust_begin_short_backtrace (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EB8: std::rt::lang_start::{{closure}} (in /home/admin/vector/target/release/vector)
==20819==    by 0x8BC199F: std::rt::lang_start_internal (in /home/admin/vector/target/release/vector)
==20819==    by 0x140A434: main (in /home/admin/vector/target/release/vector)
==20819==  If you believe this happened as a result of a stack
==20819==  overflow in your program's main thread (unlikely but
==20819==  possible), you can try to increase the size of the
==20819==  main thread stack using the --main-stacksize= flag.
==20819==  The main thread stack size used in this run was 108392448.
==20819==
==20819== HEAP SUMMARY:
==20819==     in use at exit: 172,057 bytes in 471 blocks
==20819==   total heap usage: 478 allocs, 7 frees, 174,129 bytes allocated
==20819==
==20819== LEAK SUMMARY:
==20819==    definitely lost: 0 bytes in 0 blocks
==20819==    indirectly lost: 0 bytes in 0 blocks
==20819==      possibly lost: 0 bytes in 0 blocks
==20819==    still reachable: 172,057 bytes in 471 blocks
==20819==         suppressed: 0 bytes in 0 blocks
==20819== Reachable blocks (those to which a pointer was found) are not shown.
==20819== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==20819==
==20819== For lists of detected and suppressed errors, rerun with: -s
==20819== ERROR SUMMARY: 2 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault

st-omarkhalid commented 2 weeks ago

@jszwedko So I trimmed down on the number of pipelines and found that this issue surfaces when I enable pipelines with high cardinality metrics and histograms. Is there a setting in vector that would force flush the metrics or ratelimit them?

Batches are flushed when 1 of 2 conditions are met:

The batch age meets or exceeds the configured timeout_secs.
The batch size meets or exceeds the configured max_bytes or max_events.

We have it set for one sink but not the other.

  datadog_sink:
    type: datadog_metrics
    inputs:
    - remap_pod_*
    site: datadoghq.com
  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: <url>
    tls:
      alpn_protocols:
      - h2
      enabled: true
    buffer:
      type: disk
      when_full: drop_newest
      max_size: 10737418240
    batch:
      max_bytes: 2500000
      timeout_secs: 15
    healthcheck: true
    compression: true

What would be the default behaviour for dd sink above in terms of flushing the buffers? Also, is there a way to block/rate-limit specific metrics?

st-omarkhalid commented 2 weeks ago

@jszwedko @pront I looked at the active time-series on the cluster and we have around 4M of them. Is that too much for the vector-agent to handle?

pront commented 2 weeks ago

Regarding throttling, we have https://vector.dev/docs/reference/configuration/transforms/throttle/ but that doesn't work with metrics yet.

Also, I am not aware of anything that allows you to force-flush but you can set batch.max_events to something smaller like in https://github.com/vectordotdev/vector/issues/21655#issuecomment-2448391056.

@jszwedko @pront I looked at the active time-series on the cluster and we have around 4M of them. Is that too much for the vector-agent to handle?

There's no hard limit for this. It all depends how fast you produce/consume and your buffer strategy. If you are dropping events when the buffer is full, it shouldn't cause OOM.

st-omarkhalid commented 1 week ago

I got a heap profile and this is how a diff looks


Showing nodes accounting for 13.73GB, 94.84% of 14.47GB total
Dropped 190 nodes (cum <= 0.07GB)
      flat  flat%   sum%        cum   cum%
   13.73GB 94.84% 94.84%    13.73GB 94.84%  prof_backtrace_impl
         0     0% 94.84%     0.27GB  1.90%  <(FnA,FnB,FnC) as nom::sequence::Tuple<Input,(A,B,C),Error>>::parse
         0     0% 94.84%     0.22GB  1.52%  <F as nom::internal::Parser<I,O,E>>::parse
         0     0% 94.84%    13.11GB 90.58%  <T as futures_util::fns::FnMut1<A>>::call_mut
         0     0% 94.84%     0.33GB  2.30%  <alloc::collections::btree::map::BTreeMap<K,V,A> as core::clone::Clone>::clone::clone_subtree
         0     0% 94.84%     0.37GB  2.55%  <alloc::string::String as core::clone::Clone>::clone
         0     0% 94.84%     0.35GB  2.44%  <alloc::vec::Vec<T,A> as core::clone::Clone>::clone
         0     0% 94.84%    13.17GB 91.00%  <core::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
         0     0% 94.84%    13.12GB 90.68%  <futures_util::stream::futures_unordered::FuturesUnordered<Fut> as futures_core::stream::Stream>::poll_next
         0     0% 94.84%    13.12GB 90.68%  <futures_util::stream::stream::Flatten<St> as futures_core::stream::Stream>::poll_next
         0     0% 94.84%    13.10GB 90.52%  <futures_util::stream::stream::filter_map::FilterMap<St,Fut,F> as futures_core::stream::Stream>::poll_next
         0     0% 94.84%    13.12GB 90.68%  <futures_util::stream::stream::flatten_unordered::FlattenUnorderedWithFlowController<St,Fc> as futures_core::stream::Stream>::poll_next
         0     0% 94.84%    13.12GB 90.68%  <futures_util::stream::stream::flatten_unordered::PollStreamFut<St> as core::future::future::Future>::poll
         0     0% 94.84%    13.73GB 94.85%  <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
         0     0% 94.84%     0.18GB  1.27%  <tower::limit::rate::service::RateLimit<S> as tower_service::Service<Request>>::call
         0     0% 94.84%     0.18GB  1.27%  <tower::retry::Retry<P,S> as tower_service::Service<Request>>::call
         0     0% 94.84%    13.73GB 94.83%  <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
         0     0% 94.84%     0.18GB  1.27%  <vector::sinks::util::retries::FibonacciRetryPolicy<L> as tower::retry::policy::Policy<Req,Res,alloc::boxed::Box<dyn core::error::Error+core::marker::Send+core::marker::Sync>>>::clone_request
         0     0% 94.84%     0.19GB  1.30%  <vector::sinks::vector::sink::VectorSink<S> as vector_core::sink::StreamSink<vector_core::event::Event>>::run::{{closure}}
         0     0% 94.84%    13.11GB 90.57%  <vector::sources::prometheus::scrape::PrometheusScrapeContext as vector::sources::util::http_client::HttpClientContext>::on_response
         0     0% 94.84%     0.19GB  1.30%  <vector::transforms::remap::Remap<Runner> as vector_core::transform::SyncTransform>::transform
         0     0% 94.84%     2.87GB 19.81%  <vector_core::event::metadata::EventMetadata as core::default::Default>::default
         0     0% 94.84%     1.60GB 11.03%  <vector_core::event::metadata::Inner as core::default::Default>::default
         0     0% 94.84%     0.16GB  1.11%  <vector_core::event::metric::Metric as core::clone::Clone>::clone
         0     0% 94.84%     7.04GB 48.67%  <vector_core::event::metric::tags::MetricTags as core::convert::From<alloc::collections::btree::map::BTreeMap<alloc::string::String,alloc::string::String>>>::from
         0     0% 94.84%     0.18GB  1.24%  <vector_core::event::vrl_target::VrlTarget as vrl::compiler::target::Target>::target_insert
         0     0% 94.84%     0.17GB  1.19%  <vector_core::sink::EventStream<T> as vector_core::sink::StreamSink<vector_core::event::array::EventArray>>::run::{{closure}}
         0     0% 94.84%     0.18GB  1.24%  <vrl::compiler::expression::Expr as vrl::compiler::expression::Expression>::resolve
         0     0% 94.84%     0.18GB  1.24%  <vrl::compiler::expression::assignment::Assignment as vrl::compiler::expression::Expression>::resolve
         0     0% 94.84%    13.73GB 94.84%  [libc.so.6]
         0     0% 94.84%    11.05GB 76.35%  _rjem_je_malloc_default
         0     0% 94.84%    13.73GB 94.84%  _rjem_je_prof_backtrace
         0     0% 94.84%    13.73GB 94.84%  _rjem_je_prof_tctx_create
         0     0% 94.84%     0.13GB  0.89%  alloc::collections::btree::map::BTreeMap<K,V,A>::insert
         0     0% 94.84%     7.17GB 49.55%  alloc::collections::btree::map::entry::VacantEntry<K,V,A>::insert
         0     0% 94.84%     0.13GB  0.89%  alloc::collections::btree::node::Handle<alloc::collections::btree::node::NodeRef<alloc::collections::btree::node::marker::Mut,K,V,alloc::collections::btree::node::marker::Leaf>,alloc::collections::btree::node::marker::Edge>::insert_recursing
         0     0% 94.84%     2.67GB 18.41%  alloc::raw_vec::RawVec<T,A>::grow_one
         0     0% 94.84%     0.24GB  1.65%  alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
         0     0% 94.84%     2.90GB 20.06%  alloc::raw_vec::finish_grow
         0     0% 94.84%    13.73GB 94.85%  core::ops::function::FnOnce::call_once{{vtable.shim}}
         0     0% 94.84%     2.68GB 18.49%  do_rallocx
         0     0% 94.84%    13.13GB 90.73%  futures_util::stream::stream::StreamExt::poll_next_unpin
         0     0% 94.84%     0.27GB  1.90%  prometheus_parser::line::Line::parse
         0     0% 94.84%     0.27GB  1.90%  prometheus_parser::parse_text
         0     0% 94.84%    13.73GB 94.84%  std::sys::backtrace::_rust_begin_short_backtrace
         0     0% 94.84%    13.73GB 94.85%  std::sys::pal::unix::thread::Thread::new::thread_start
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::blocking::pool::Inner::run
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::context::runtime::enter_runtime
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::scheduler::multi_thread::worker::Context::run
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::scheduler::multi_thread::worker::Context::run_task
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::scheduler::multi_thread::worker::run
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::task::core::Core<T,S>::poll
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::task::harness::Harness<T,S>::poll
         0     0% 94.84%    13.18GB 91.06%  vector::source_sender::SourceSender::send_event_stream::{{closure}}
         0     0% 94.84%    12.83GB 88.67%  vector::sources::prometheus::parser::reparse_groups
         0     0% 94.84%    13.18GB 91.06%  vector::sources::util::http_client::call::{{closure}}
         0     0% 94.84%     0.17GB  1.19%  vector::topology::builder::Builder::build_sinks::{{closure}}::{{closure}}
         0     0% 94.84%    13.18GB 91.04%  vector::topology::builder::Builder::build_sources::{{closure}}::{{closure}}

jszwedko commented 1 week ago

@jszwedko So I trimmed down on the number of pipelines and found that this issue surfaces when I enable pipelines with high cardinality metrics and histograms. Is there a setting in vector that would force flush the metrics or ratelimit them?

Can you share the slimmed down configuration that still reproduces the issue? That would help narrow the search space.

jszwedko commented 1 week ago

@jszwedko @pront I looked at the active time-series on the cluster and we have around 4M of them. Is that too much for the vector-agent to handle?

If the total cardinality was an issue, I'd expect Vector to hit that wall right away rather than the steady increase you seem to be seeing 🤔

jszwedko commented 1 week ago

@st-omarkhalid do you know if the metrics are "churning"? That is: say we have 4M metrics at time t1, at time t2 could we have a different set of metrics even if the cardinality is still a total of 4M?

st-omarkhalid commented 1 week ago

@jszwedko Looks like there a correlation with receive errors. I noticed some timeouts in the logs so adjusted the scrape-timeout which not only resolved the errors but also the memory growth. Screenshot 2024-11-15 at 7 44 12 AM This definitely looks like a bug in the agent. The requests could fail for many reasons but that should not impact the agent.

st-omarkhalid commented 1 week ago

Another related issue. With the following vector-sink the metrics looked spotty in our dashboards. I noticed that the disk buffer would get filled up within 30mins so maybe due to drop_newest it would simply drop the new events.

vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: <url>
    tls:
      alpn_protocols:
      - h2
      enabled: true
    buffer:
      type: disk
      when_full: drop_newest
      max_size: 10737418240
    batch:
      max_bytes: 2500000
      timeout_secs: 15
    healthcheck: true
    compression: true

I tried the following config instead

buffer:
      type: disk
      when_full: block
      max_size: 5368709120

thinking it would slow down the ingestion once the buffer was filled up. Unfortunately this again lead to the memory growth.

Screenshot 2024-11-15 at 1 52 00 PM Not sure how to debug this further.

jszwedko commented 1 week ago

@jszwedko Looks like there a correlation with receive errors. I noticed some timeouts in the logs so adjusted the scrape-timeout which not only resolved the errors but also the memory growth.

Hmm, that is interesting. It sounds like some requests may have been stacking up in the prometheus_scape and not being reclaimed when they timed out, but that would be surprising.

thinking it would slow down the ingestion once the buffer was filled up. Unfortunately this again lead to the memory growth.

I think what is happening here is that the sink cannot send the data fast enough and so it is piling up first in the buffers and then in the scraping source, which I believe will continue scraping on the interval even if it can't flush the data downstream (this is probably suboptimal; I think the source should stop scraping the previous request hasn't been flushed downstream yet). To fix this, I think you'll need to adjust the sink to meet the throughput requirements of your source. Some ideas to do this:

Set request.concurrency to a high static number
If that isn't enough and you are bottlenecked on another resource, increase that resource (e.g. CPU, memory, or network bandwidth)

st-omarkhalid commented 1 week ago

@jszwedko You're right the egress volume's not keeping up with the ingress causes the agent to hit OOM. However for block it seems to be a bug as it would cause data-loss for events which have not been flushed to disk. Right?

I think there are two bugs here:

prometheus_scape causes OOM when there are requests timeouts.
Again, the prometheus_scape would hit OOM with block buffer when ingress is greater than egress volume.

How could we track these two issues? Otherwise this thread could be closed.

pront commented 1 week ago

Hi @st-omarkhalid, OOM issues have been historically quite tricky to debug since we don't have identical environments to reproduce them and there are so many variables involved. I really appreciate the details you included in this issue. It would be beneficial to isolate concerns. If you don't mind you can create a new bug issue for each bug and summarize the situation there. For example, I am not sure what config we are talking about for bug (2).

st-omarkhalid commented 6 days ago

@pront I will open two separate issues for these bugs. One follow up though - when I set request.concurrency how does the vector-agent guarantee order of the events. Say one batch is send to the vector-server but fails and is retried but another batch with newer events is processed first by the vector-server. How will these two batches be ordered in the end?

jszwedko commented 6 days ago

@jszwedko You're right the egress volume's not keeping up with the ingress causes the agent to hit OOM. However for block it seems to be a bug as it would cause data-loss for events which have not been flushed to disk. Right?

I think there are two bugs here:

prometheus_scape causes OOM when there are requests timeouts.

Agreed. I think we need to verify that the prometheus_scrape is dropping requests that time out correctly. It seems like, based on your report, that they may be leaking.

Again, the prometheus_scape would hit OOM with block buffer when ingress is greater than egress volume.

How could we track these two issues? Otherwise this thread could be closed.

Agreed. The prometheus_scrape source should not scrape again if a previous scrape is still pending to avoid stacking scrapes.

@pront I will open two separate issues for these bugs. One follow up though - when I set request.concurrency how does the vector-agent guarantee order of the events. Say one batch is send to the vector-server but fails and is retried but another batch with newer events is processed first by the vector-server. How will these two batches be ordered in the end?

Batches are sent concurrently so it is up to the downstream system to order events by timestamp. You are already sending concurrent requests, it is just using the default of "adaptive concurrency" where it tries to to algorithmically determine the optimal concurrency (which is reported by the vector_adaptive_concurrency_limit histogram we were looking at previously). I'm suggesting setting it to a static value in case the algorithm isn't finding the optimal concurrency and you could be pushing it higher. https://vector.dev/docs/about/under-the-hood/networking/arc/ describes the adaptive concurrency mechanism that is used by default.

vectordotdev / vector