influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.55k stars 5.56k forks source link

Graylog output plugin: add connection TTL option #10367

Open alespour opened 2 years ago

alespour commented 2 years ago

Feature Request

Add connection TTL option to avoid node overload when using multiple load-balanced GL servers. Request per comment in PR #9644.

Generic problem description: https://www.ateam-oracle.com/post/long-lived-tcp-connections-and-load-balancers

Proposal:

Add connection TTL option.

Current behavior:

Connection are long-lived (keep alive is set by default).

Desired behavior:

When TTL option is set, connection(s) should be closed and re-established after specified duration elapsed.

Use case:

Proper load distribution among multiple load-balanced GL servers.

powersj commented 2 years ago

next steps: look at creating an option to set a TTL option with the HTTPClient used with the graylog input plugin

powersj commented 2 years ago

@tamirhad,

I'm curious if you are actually seeing one node get all the traffic from telegraf? You said in your configuration you have a load balancer in place. Is that the IP address you use in your telegraf config? Do you see telegraf hitting different nodes or stick to the first one it touches?

I briefly looked at this, and there doesn't seem to be an obvious TTL setting for the HTTP client.

Thanks!

tamirhad commented 2 years ago

@powersj Actually I don't remember the exact setup(if I was using lb or not) but what I do remember is that telegraf kept sending traffic to particular node when working with tcp(since the connection never closed).

powersj commented 2 years ago

@powersj Actually I don't remember the exact setup(if I was using lb or not) but what I do remember is that telegraf kept sending traffic to particular node when working with tcp(since the connection never closed).

That detail would be good to know: specifically where you saw this, what config, and with or without a loud balancer. I'm not sure off-hand what else Telegraf can do in this case, without setting some sort of TTL option and then re-initializing the client when those TTL is hit.

tamirhad commented 2 years ago

@powersj Actually I don't remember the exact setup(if I was using lb or not) but what I do remember is that telegraf kept sending traffic to particular node when working with tcp(since the connection never closed).

That detail would be good to know: specifically where you saw this, what config, and with or without a loud balancer. I'm not sure off-hand what else Telegraf can do in this case, without setting some sort of TTL option and then re-initializing the client when those TTL is hit.

In my setup, the ttl was the only way the source(telegraf) could do something. I think it's the way to go.

powersj commented 2 years ago

In my setup, the ttl was the only way the source(telegraf) could do something.

I'm not sure I follow

tamirhad commented 2 years ago

In my setup, the ttl was the only way the source(telegraf) could do something.

I'm not sure I follow

Yeah I'll be more clear. Besides ttl I don't think we have a way to manipulate this behavior from telegraf side. If I remember correctly we were using aws-nlb which is not giving the ability to set ttl in contrast to aws-alb. Edit: what info could be useful? I might get some configs and topology.

powersj commented 2 years ago

Edit: what info could be useful? I might get some configs and topology.

I am trying to take a step back and understand the use-case here. My understanding of the graylog input plugin was to collect graylog server metrics. As such you would want to include in your Telegraf config a list of all the graylog servers you want to monitor.

Did you have a different use case? If so,

1) what data are you collecting? 2) what was your graylog server layout? 3) did you have load balancer? 4) what was your telegraf config? were you pointed at the load balancer or directly at a specific server? 5) what logs can you share that demonstrate telegraf was pointed at a single server the entire time?

Thanks in advance!

tamirhad commented 2 years ago

Edit: what info could be useful? I might get some configs and topology.

I am trying to take a step back and understand the use-case here. My understanding of the graylog input plugin was to collect graylog server metrics. As such you would want to include in your Telegraf config a list of all the graylog servers you want to monitor.

Actually I was talking about graylog output plug-in and not input:

https://github.com/influxdata/telegraf/tree/master/plugins/outputs/graylog

I'll try to provide the info I have asap.

powersj commented 2 years ago

Actually I was talking about graylog output plug-in and not input:

:facepalm: thanks for clarifying :) even the issue title says output but my brain went input

powersj commented 2 years ago

@tamirhad - going through issues to size and prioritize. I was wondering if you ever got some info from your graylog server?

Thanks!

tamirhad commented 2 years ago
  1. we are collecting metrics from statsd input plugin, but it dosent matter, it could be any data coming from any input plugin.
  2. the topology is as follows: telegraf --> NLB --> graylog containers
  3. the connection from telegraf to NLB is by using TCP on the grylog output plugin(pointing to NLB address).
  4. i can clearly see that the connection from telegraf is staying open indefinitely( using netcat command ) until graylog server sends FIN to teelgraf, then the connection terminates. edit: question: what would happen if we specify the same endpoint multiple times(in the list of endpoints), do we create multiple connections? if so, it could be an "ugly" solution to distribute the traffic(the nlb will route the connection to different graylog backend).