vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.11k stars 1.48k forks source link

Vector sink - Allow setting max duration/max messages for a TCP connection #10728

Open alexandv opened 2 years ago

alexandv commented 2 years ago

Community Note

Current Vector Version

0.19

Use-cases

When using a vector aggregator deployed inside Kubernetes it is often useful to scale it horizontally by using multiple replicas. Those replicas are usually exposed by a Kubernetes service. If we use the vector sink to send the logs to the vector aggregator through the Kubernetes service the TCP connection stays persistent (with v1 or v2) which makes load balancing between the pods difficult. It would be useful to have a setting to specify the maximum number of messages that can be sent on the same TCP connection so that it can reconnect to another target pod.

Attempted Solutions

I tried with vector v1 and v2 protocols. The only workaround for now would be to use a proxy like envoy to force a disconnection after a few seconds or a few messages.

Proposal

Introduce a max duration and/or max messages before reconnection.

References

It is mentioned in this issue: https://github.com/vectordotdev/vector/issues/2070#issuecomment-841094902 that the gRPC connection is not long-lived but after checking with tcpdump I can see that the TCP connection stays open.

Fluentd has keepalive_secs for the forward plugin that would reestablish a connection after the given amount of time: https://docs.fluentd.org/output/forward#keepalive_timeout Fluentbit has net.keepalive_max_recycle that reestablish the TCP connection after a number of messages: https://docs.fluentbit.io/manual/administration/networking

joscha-alisch commented 1 year ago

We have the same issue (in our case with the socket sink, not http). Essentially right now, autoscaling in kubernetes is useless, as the existing connections don't get dropped and the newly scaled up instances don't get used at all. A setting like the one proposed here would be very useful for us!

nabokihms commented 1 year ago

@joscha-alisch, thanks for working on this one. We are stuck with the same problem in Kubernetes.

As you mentioned in the attached PR, the same approach should be applied to vector v2 and socket sinks. I'm curious whether you want to go ahead and add fixes because now we are blocked from integrating vector further.