Open st-omarkhalid opened 6 days ago
Hey! Diagnosing memory issues in Vector can be tricky. A few questions that may help:
adaptive_concurrency_limit
metric? I'm wondering if an increasing number of concurrent requests are causing pressureThanks @jszwedko for the response. We don't have any limits applied but the pod typically hit OOM around 29GB. Yes vector_adaptive_concurrency_limit_count
does correlate with the memory growth.
Interesting, thanks for sharing that graph. It seems likely to me that the issue is that the concurrency limit is never finding a max. You could try to configure a max via request.adaptive_concurrency.max_concurrency_limit
(e.g. https://vector.dev/docs/reference/configuration/sinks/vector/#request.adaptive_concurrency.max_concurrency_limit).
That did not help. Did I get it right?
vector_sink:
type: vector
inputs:
- tiering.tier1
address: https://<host>:8903
request:
adaptive_concurrency:
max_concurrency_limit: 500
That did not help. Did I get it right?
vector_sink: type: vector inputs: - tiering.tier1 address: https://<host>:8903 request: adaptive_concurrency: max_concurrency_limit: 500
That looks right. Did you observe adaptive_concurrency_limit
not exceeding this limit?
It went past it and the memory growth had the same behavior as before.
@jszwedko Wait - looking at the breakdown per sink type it looks like most of the concurrency_limit are coming from another sink which does not have the setting. Let me update the other sink and report back.
@jszwedko I set the data-dog sink at a limit of 100 but it's at 1K after 50 mins and memory has grown linearly as well.
@jszwedko I set the data-dog sink at a limit of 100 but it's at 1K after 50 mins and memory has grown linearly as well.
Hmm, can you share the config you are trying for the Datadog Logs sink?
Here
datadog_sink:
type: datadog_metrics
inputs:
- remap_pod_*
default_api_key: <key>
site: datadoghq.com
request:
adaptive_concurrency:
max_concurrency_limit: 100
That looks right. Are you confident that that is the sink that is exceeding the limit? All of the others are respecting it?
Here - vector gets to 55-65 and dd-metrics gets to 1k before the pod hit OOM.
Gotcha, thanks! That does look like it is exceeding the max. I'm having trouble reproducing this behavior locally though 😢
I'm running this config:
sources:
source0:
namespace: vector
scrape_interval_secs: 0.1
type: internal_metrics
source1:
namespace: vector
scrape_interval_secs: 0.1
type: internal_metrics
sinks:
sink0:
inputs:
- source0
type: datadog_metrics
batch:
max_events: 1
request:
adaptive_concurrency:
initial_concurrency: 1
max_concurrency_limit: 5
sink1:
inputs:
- source1
type: datadog_metrics
request:
adaptive_concurrency:
initial_concurrency: 1
max_concurrency_limit: 5
For source0
I'm limiting batch.max_events
to 1
to cause it to make more requests than it otherwise would. I see the max for sink0
bubbling around 6. If I remove the limit it goes up to 50. I tried with both Vector v0.35.0 and v0.42.0.
For a moment I thought the limits had improved but that's still not the case. There were a couple of evicted pods which had lower limits but otherwise the limits are way over 100
for dd sink.
For context we have over 300 sources (each with its own transform) in the config.
Is that graph cumulative across multiple nodes / sinks? E.g. do you have multiple datadog_metrics
sinks in the same config?
That graph is simply vector_adaptive_concurrency_limit_count
plotted with no filters. We have a single vector-agent pod in our setup. The sawtooth pattern above is because the pod is getting OOM'd and replaced by a new one. The config has 3
sinks as pointed out in the issue description.
@jszwedko How long did you run your test? Was the limit stable at 6 for long time?
I looked at another instance of the vector-agent in a different cluster (where again the limit is not applied). The limit reached to 15K without hitting OOM - the number of sources in that cluster is a lot less but it still has 3 sinks.
On the problematic vector-agent instance I see large memory allocation across remap components (vector_component_allocated_bytes
)
That graph is simply
vector_adaptive_concurrency_limit_count
plotted with no filters. We have a single vector-agent pod in our setup.
I'm not sure you want to plot _count
, this would be the count of the number of data points in the histogram. Could you plot the max
instead?
I only ran my test for maybe 10 minutes. I can try another run.
@jszwedko You mean max(vector_adaptive_concurrency_limit_count)
- otherwise I don't see max metric for it.
@jszwedko You mean
max(vector_adaptive_concurrency_limit_count)
- otherwise I don't see max metric for it.
Hmm. What system are you sending these metrics into? vector_adaptive_concurrency_limit
is a histogram so you should be able to do things like taking the max, min, and average.
The buckets are present. Is histogram_quantile(1, increase(vector_adaptive_concurrency_limit_bucket[1h]))
something you want to see?
The buckets are present. Is
histogram_quantile(1, increase(vector_adaptive_concurrency_limit_bucket[1h]))
something you want to see?
Nice, yeah, I see histogram_quantile(1, ...)
gives you the max value. There we can see it does seem to be staying below or around the limit of 100.
My next hypothesis is that there might be backpressure causing requests to queue up in the source waiting to flush data downstream. Could you share a graph of vector_component_utilization
and buffer_send_duration_seconds
per component?
Utilization has an interesting pattern - it spikes in the remap component and then settles down.
sum(histogram_quantile(0.99, increase(vector_buffer_send_duration_seconds_bucket[10m]))) by (component_type)
Is there a more explicit metric on the queue size? Maybe https://vector.dev/docs/reference/configuration/sources/internal_metrics/#buffer_events
The buffer size is pretty stable as well.
I also checked the ingress/egress difference and that consistent as well
This does appear like a mem leak issue.
Yeah, it could be, though your config is fairly simple and it'd be surprising that if you had a leak that it wasn't affecting a large number of users. I'll try to think about this some more, but if you are able to grab a memory profile using valgrind that could be helpful to narrow down where the memory is being used.
vector doesn't even run for me with valgrind - it fails with sig-fault.
==20819== Memcheck, a memory error detector
==20819== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==20819== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==20819== Command: vector/target/release/vector
==20819==
==20819== Invalid read of size 4
==20819== at 0x3EBDE6A: _rjem_je_tcache_bin_flush_small (in /home/admin/vector/target/release/vector)
==20819== by 0x3E895B8: _rjem_je_sdallocx_default (in /home/admin/vector/target/release/vector)
==20819== by 0x2CE09D8: core::ptr::drop_in_place<clap_builder::parser::matches::matched_arg::MatchedArg> (in /home/admin/vector/target/release/vector)
==20819== by 0x3B6AB51: core::ptr::drop_in_place<clap_builder::util::flat_map::FlatMap<clap_builder::util::id::Id,clap_builder::parser::matches::matched_arg::MatchedArg>> (in /home/admin/vector/target/release/vector)
==20819== by 0x3B99EFB: core::ptr::drop_in_place<clap_builder::parser::matches::arg_matches::ArgMatches> (in /home/admin/vector/target/release/vector)
==20819== by 0x3C68D4C: vector::cli::Opts::get_matches (in /home/admin/vector/target/release/vector)
==20819== by 0x14092CE: vector::main (in /home/admin/vector/target/release/vector)
==20819== by 0x1408EC2: std::sys::backtrace::__rust_begin_short_backtrace (in /home/admin/vector/target/release/vector)
==20819== by 0x1408EB8: std::rt::lang_start::{{closure}} (in /home/admin/vector/target/release/vector)
==20819== by 0x8BC199F: std::rt::lang_start_internal (in /home/admin/vector/target/release/vector)
==20819== by 0x140A434: main (in /home/admin/vector/target/release/vector)
==20819== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==20819==
==20819==
==20819== Process terminating with default action of signal 11 (SIGSEGV)
==20819== Access not within mapped region at address 0x64
==20819== at 0x3EBDE6A: _rjem_je_tcache_bin_flush_small (in /home/admin/vector/target/release/vector)
==20819== by 0x3E895B8: _rjem_je_sdallocx_default (in /home/admin/vector/target/release/vector)
==20819== by 0x2CE09D8: core::ptr::drop_in_place<clap_builder::parser::matches::matched_arg::MatchedArg> (in /home/admin/vector/target/release/vector)
==20819== by 0x3B6AB51: core::ptr::drop_in_place<clap_builder::util::flat_map::FlatMap<clap_builder::util::id::Id,clap_builder::parser::matches::matched_arg::MatchedArg>> (in /home/admin/vector/target/release/vector)
==20819== by 0x3B99EFB: core::ptr::drop_in_place<clap_builder::parser::matches::arg_matches::ArgMatches> (in /home/admin/vector/target/release/vector)
==20819== by 0x3C68D4C: vector::cli::Opts::get_matches (in /home/admin/vector/target/release/vector)
==20819== by 0x14092CE: vector::main (in /home/admin/vector/target/release/vector)
==20819== by 0x1408EC2: std::sys::backtrace::__rust_begin_short_backtrace (in /home/admin/vector/target/release/vector)
==20819== by 0x1408EB8: std::rt::lang_start::{{closure}} (in /home/admin/vector/target/release/vector)
==20819== by 0x8BC199F: std::rt::lang_start_internal (in /home/admin/vector/target/release/vector)
==20819== by 0x140A434: main (in /home/admin/vector/target/release/vector)
==20819== If you believe this happened as a result of a stack
==20819== overflow in your program's main thread (unlikely but
==20819== possible), you can try to increase the size of the
==20819== main thread stack using the --main-stacksize= flag.
==20819== The main thread stack size used in this run was 108392448.
==20819==
==20819== HEAP SUMMARY:
==20819== in use at exit: 172,057 bytes in 471 blocks
==20819== total heap usage: 478 allocs, 7 frees, 174,129 bytes allocated
==20819==
==20819== LEAK SUMMARY:
==20819== definitely lost: 0 bytes in 0 blocks
==20819== indirectly lost: 0 bytes in 0 blocks
==20819== possibly lost: 0 bytes in 0 blocks
==20819== still reachable: 172,057 bytes in 471 blocks
==20819== suppressed: 0 bytes in 0 blocks
==20819== Reachable blocks (those to which a pointer was found) are not shown.
==20819== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==20819==
==20819== For lists of detected and suppressed errors, rerun with: -s
==20819== ERROR SUMMARY: 2 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault
A note for the community
Problem
Vector-agent in our deployment shows constant memory growth till the pod hits OOM. This is happening continuously. I looked at a number of other issues open already on the same problem but it's not clear how to resolve it. In prod we have a lot more pipelines than shown below though.
The metric
vector_component_allocated_bytes
shows remap-* components have the most memory allocation and constantly growing.Configuration
Version
0.35.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response