Open m0untains opened 1 year ago
Seeing the same issue in 0.27.0 using aws_kinesis_firehose
, suspect its the same issue.
We have vector agents in ec2 instances that idle at a few mb memory usage, but when a large amount of data comes in ( ~380mb of audit logs in our test case), vector immediately starts gobbling up all system memory and eventually our instance fully freezes. I've also tried a number of methods for throttling the throughput (request rate limit, disk buffer, adaptive concurrency with aggresive falloff), but either vector becomes severely throttled or the rate of memory gobbling is slowed, but still climbs to the point of eating all system memory.
I've confirmed that this is an issue with the kinesis sink specifically as we are able to sink to an s3 bucket or a local file at full throughput with a minimal increase in memory usage. It's only when trying to sink to a kinesis stream that this happens. This also isn't an issue of kinesis rate limits as we're able to trigger this when writing to a kinesis firehose with an increased rate limit, and I've confirmed that the rate limit isn't hit while this happens.
I think @jszwedko is on the right track regarding concurrency (in the comment from the linked issue above).
This change seems to work around the issue to some extent:
diff --git a/src/sinks/aws_kinesis/sink.rs b/src/sinks/aws_kinesis/sink.rs
index 0341c0e82..5ccdd10b4 100644
--- a/src/sinks/aws_kinesis/sink.rs
+++ b/src/sinks/aws_kinesis/sink.rs
@@ -40,7 +40,7 @@ where
R: Record + Send + Sync + Unpin + Clone + 'static,
{
async fn run_inner(self: Box<Self>, input: BoxStream<'_, Event>) -> Result<(), ()> {
- let request_builder_concurrency_limit = NonZeroUsize::new(50);
+ let request_builder_concurrency_limit = NonZeroUsize::new(1);
let partition_key_field = self.partition_key_field.clone();
I tested a build with this change, and Vector was able to successfully churn through a large k8s log file with reasonable memory usage (~1Gi).
I can even set request_builder_concurrency_limit to 5 without having memory exploding (it uses a bit more, but still < 1.5Gi).
That being said, I am not sure what the implications are of this.
One other peculiar item we've noticed:
We use the default setting of none
for request_concurrency in the kinesis sink, however during the processing of large log files, vector has quite a few simultaneous connections over to aws (34.223.45.0/25
, us-west-2
). This is a capture from an ephemeral debug container attached to the vector pod:
root@vector-wvczl:/# netstat -tnp | grep vector | grep ESTABLISHED | grep '34.223.45' | wc -l
92
root@vector-wvczl:/#
One other peculiar item we've noticed:
We use the default setting of
none
for request_concurrency in the kinesis sink, however during the processing of large log files, vector has quite a few simultaneous connections over to aws (34.223.45.0/25
,us-west-2
). This is a capture from an ephemeral debug container attached to the vector pod:root@vector-wvczl:/# netstat -tnp | grep vector | grep ESTABLISHED | grep '34.223.45' | wc -l 92 root@vector-wvczl:/#
I think the documentation might actually be wrong there and the default is, in fact, adaptive
. You could try setting request.concurrency = 1
to limit.
Yes, you are correct. Setting request.concurrency = 1
yields:
root@vector-xptj6:/# netstat -tnp | grep vector | grep ESTABLISHED | grep '34.223.45' | wc -l
1
root@vector-xptj6:/#
So the default seems to be adaptive. That being said, unfortunately even setting concurrency to 1 didn't solve the OOM issue. Memory growth is definitely slower, however it steadily rises until it reaches the limit. So far the only effective workaround I've found is this change here, however I am unclear of the implications.
Yes, you are correct. Setting
request.concurrency = 1
yields:root@vector-xptj6:/# netstat -tnp | grep vector | grep ESTABLISHED | grep '34.223.45' | wc -l 1 root@vector-xptj6:/#
So the default seems to be adaptive. That being said, unfortunately even setting concurrency to 1 didn't solve the OOM issue. Memory growth is definitely slower, however it steadily rises until it reaches the limit. So far the only effective workaround I've found is this change here, however I am unclear of the implications.
I tested the same and can confirm memory grows linearly with events processed, but is never released.
We've resorted to bypassing kinesis for the time being and sinking directly to S3. This fortunately works well enough for our use case. Max memory usage with the S3 sink is ~2.5mb.
I will note that 0.33.0
is better in this regard (seems like the default request builder concurrency moved down from 50->8).
Also there is this knob that has allowed us to workaround this issue to an extent (by dialing down VECTOR_EXPERIMENTAL_REQUEST_BUILDER_CONCURRENCY
further to 4).
A note for the community
Problem
We ship json-encoded Kubernetes container logs from EKS worker nodes to a Kinesis stream. We have noticed that in the presence of large container log files, the
aws_kinesis_streams
increases memory usage until it gets OOMKilled.Here is how to reproduce this:
First, launch a pod that generates a couple of hundred megs of console output. You could do this by using an Ubuntu container launched with this command:
The log file size can be verified by logging onto a worker node:
In this case, the file in question is ~383M. The relevant part of the vector configuration is given below. Running this with the
--allocation-tracing
option, we see the following fromvector top
:In this example, we've set an
8Gi
limit for the pod, which it exhausts in a matter of seconds. We've also experimented with settingrequest.rate_limit_num
. In this case, the memory growth is a bit slower, but it still OOMs in a matter of minutes.Interestingly, if you dial-down the amount of pod output, e.g. by changing
head -n 10000000
tohead -n 100000
in the above configuration, we can see the memory rise temporarily to ~2GiB, and then drop back down to ~23MiB after about a minute.Is there anything obviously wrong we are overlooking here?
Configuration
Version
vector 0.32.0 (x86_64-unknown-linux-gnu 1b403e1 2023-08-15 14:56:36.089460954)
Debug Output
No response
Example Data
No response
Additional Context
Vector is running as a daemonset in EKS, using an IAM role for the service account.
References
No response