vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.87k stars 1.58k forks source link

Uncontrolled memory growth in aws_kinesis_streams #18397

Open m0untains opened 1 year ago

m0untains commented 1 year ago

A note for the community

Problem

We ship json-encoded Kubernetes container logs from EKS worker nodes to a Kinesis stream. We have noticed that in the presence of large container log files, the aws_kinesis_streams increases memory usage until it gets OOMKilled.

Here is how to reproduce this:

First, launch a pod that generates a couple of hundred megs of console output. You could do this by using an Ubuntu container launched with this command:

    command:
    - bash
    - -c
    - >
      yes "Some text" | head -n 10000000 ; sleep infinity

The log file size can be verified by logging onto a worker node:

root@worker001:/var/log/pods/vector_logs-pod-001_31f836e9-f35e-4a67-a291-87dd0507bd83/debug-node# ls -lh
total 391M
-rw-r----- 1 root root    0 Aug 26 16:41 0.log
-rw-r--r-- 1 root root 7.6M Aug 26 16:41 0.log.20230826-164135.gz
-rw-r----- 1 root root 383M Aug 26 16:41 0.log.20230826-164145
root@worker001:/var/log/pods/vector_logs-pod-001_31f836e9-f35e-4a67-a291-87dd0507bd83/debug-node#

In this case, the file in question is ~383M. The relevant part of the vector configuration is given below. Running this with the --allocation-tracing option, we see the following from vector top:

Vector OOM

In this example, we've set an 8Gi limit for the pod, which it exhausts in a matter of seconds. We've also experimented with setting request.rate_limit_num. In this case, the memory growth is a bit slower, but it still OOMs in a matter of minutes.

Interestingly, if you dial-down the amount of pod output, e.g. by changing head -n 10000000 to head -n 100000 in the above configuration, we can see the memory rise temporarily to ~2GiB, and then drop back down to ~23MiB after about a minute.

Is there anything obviously wrong we are overlooking here?

Configuration

data:
  agent.yaml: |
    data_dir: /vector-data-dir
    api:
      enabled: true
      address: 127.0.0.1:8686
      playground: false
    expire_metrics_secs: 86400

    sources:
      kubernetes_logs:
        type: kubernetes_logs
        use_apiserver_cache: true
        ignore_older_secs: 43200

    sinks:
      aws_kinesis_container:
        type: aws_kinesis_streams
        inputs: [kubernetes_logs]
        stream_name: ${EKS_CLUSTER_NAME}_container_logs
        region: ${AWS_REGION}
        encoding:
          codec: json
        request:
          rate_limit_num: 100

Version

vector 0.32.0 (x86_64-unknown-linux-gnu 1b403e1 2023-08-15 14:56:36.089460954)

Debug Output

No response

Example Data

No response

Additional Context

Vector is running as a daemonset in EKS, using an IAM role for the service account.

References

No response

ViggyNash commented 1 year ago

Seeing the same issue in 0.27.0 using aws_kinesis_firehose, suspect its the same issue.

We have vector agents in ec2 instances that idle at a few mb memory usage, but when a large amount of data comes in ( ~380mb of audit logs in our test case), vector immediately starts gobbling up all system memory and eventually our instance fully freezes. I've also tried a number of methods for throttling the throughput (request rate limit, disk buffer, adaptive concurrency with aggresive falloff), but either vector becomes severely throttled or the rate of memory gobbling is slowed, but still climbs to the point of eating all system memory.

I've confirmed that this is an issue with the kinesis sink specifically as we are able to sink to an s3 bucket or a local file at full throughput with a minimal increase in memory usage. It's only when trying to sink to a kinesis stream that this happens. This also isn't an issue of kinesis rate limits as we're able to trigger this when writing to a kinesis firehose with an increased rate limit, and I've confirmed that the rate limit isn't hit while this happens.

m0untains commented 1 year ago

I think @jszwedko is on the right track regarding concurrency (in the comment from the linked issue above).

This change seems to work around the issue to some extent:

diff --git a/src/sinks/aws_kinesis/sink.rs b/src/sinks/aws_kinesis/sink.rs
index 0341c0e82..5ccdd10b4 100644
--- a/src/sinks/aws_kinesis/sink.rs
+++ b/src/sinks/aws_kinesis/sink.rs
@@ -40,7 +40,7 @@ where
     R: Record + Send + Sync + Unpin + Clone + 'static,
 {
     async fn run_inner(self: Box<Self>, input: BoxStream<'_, Event>) -> Result<(), ()> {
-        let request_builder_concurrency_limit = NonZeroUsize::new(50);
+        let request_builder_concurrency_limit = NonZeroUsize::new(1);

         let partition_key_field = self.partition_key_field.clone();

I tested a build with this change, and Vector was able to successfully churn through a large k8s log file with reasonable memory usage (~1Gi).

I can even set request_builder_concurrency_limit to 5 without having memory exploding (it uses a bit more, but still < 1.5Gi).

That being said, I am not sure what the implications are of this.

m0untains commented 1 year ago

One other peculiar item we've noticed:

We use the default setting of none for request_concurrency in the kinesis sink, however during the processing of large log files, vector has quite a few simultaneous connections over to aws (34.223.45.0/25, us-west-2). This is a capture from an ephemeral debug container attached to the vector pod:

root@vector-wvczl:/# netstat -tnp | grep vector | grep ESTABLISHED | grep '34.223.45' | wc -l
92
root@vector-wvczl:/#
jszwedko commented 1 year ago

One other peculiar item we've noticed:

We use the default setting of none for request_concurrency in the kinesis sink, however during the processing of large log files, vector has quite a few simultaneous connections over to aws (34.223.45.0/25, us-west-2). This is a capture from an ephemeral debug container attached to the vector pod:

root@vector-wvczl:/# netstat -tnp | grep vector | grep ESTABLISHED | grep '34.223.45' | wc -l
92
root@vector-wvczl:/#

I think the documentation might actually be wrong there and the default is, in fact, adaptive. You could try setting request.concurrency = 1 to limit.

m0untains commented 1 year ago

Yes, you are correct. Setting request.concurrency = 1 yields:

root@vector-xptj6:/# netstat -tnp | grep vector | grep ESTABLISHED | grep '34.223.45' | wc -l
1
root@vector-xptj6:/#

So the default seems to be adaptive. That being said, unfortunately even setting concurrency to 1 didn't solve the OOM issue. Memory growth is definitely slower, however it steadily rises until it reaches the limit. So far the only effective workaround I've found is this change here, however I am unclear of the implications.

ViggyNash commented 1 year ago

Yes, you are correct. Setting request.concurrency = 1 yields:

root@vector-xptj6:/# netstat -tnp | grep vector | grep ESTABLISHED | grep '34.223.45' | wc -l
1
root@vector-xptj6:/#

So the default seems to be adaptive. That being said, unfortunately even setting concurrency to 1 didn't solve the OOM issue. Memory growth is definitely slower, however it steadily rises until it reaches the limit. So far the only effective workaround I've found is this change here, however I am unclear of the implications.

I tested the same and can confirm memory grows linearly with events processed, but is never released.

We've resorted to bypassing kinesis for the time being and sinking directly to S3. This fortunately works well enough for our use case. Max memory usage with the S3 sink is ~2.5mb.

m0untains commented 1 year ago

I will note that 0.33.0 is better in this regard (seems like the default request builder concurrency moved down from 50->8).

Also there is this knob that has allowed us to workaround this issue to an extent (by dialing down VECTOR_EXPERIMENTAL_REQUEST_BUILDER_CONCURRENCY further to 4).