Add support for an optional connection lifetime

perzizzle commented 4 years ago

Feature Request

Proposal:

Add an optional, configurable maximum connection lifetime.

Current behavior:

We currently have an F5 virtual ip address in front of our influx database. All of our telegraf agents connect to this ip address, which the F5 then forwards traffic on to the influx database. Because these connections are persistent and aren't recreated (except on restarting the telegraf agent) if we restart an influx node, it no longer receives any writes. For example, if we have 3 influx nodes (x, y, z), if we rolling restart the nodes we get unbalanced writes by the end. Restart X, all writes will got to Y, Z. Restart Y, all writes go to X,Z, Restart Z all writes go to X, Y. At the end state all writes are on X, Y. Ideally, if we could set a maximum connection lifetime, this would gradually rectify itself as telegraf created new connections to the F5.

Desired behavior:

Option in the telegraf.conf to set a maximum http connection lifetime.

Use case:

Currently we have automation that we run that restarts every telegraf agent in our environment whenever we need to restart influxdb. This is thousands of agents so is less than ideal. We are working on putting kafka in between our telegraf agents and the load balancer which would decrease the number of agents we need to restart but still is more of a hack than a solution.

danielnelson commented 4 years ago

I don't think this is a feature we should provide on the Telegraf side, it should be handled by the load balancer instead. I don't know much about F5, but it seems you might be able to modify the "session persistence" settings. You will want to set the balancer to load balance per HTTP request.

https://techdocs.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/ltm-profiles-reference-13-0-0/4.html

perzizzle commented 4 years ago

My understanding is those persistence settings are for when you want to always route to the same underlying service because its stateful (eg an authentication cookie).

I believe telegraf uses a persistent http conection so I can't load balance per http request which is why influx's cloud offering uses dns based load balancing and sets GODEBUG=http2client=0.

danielnelson commented 4 years ago

I've never used F5, and I'm not 100% sure this applies to the product you have, but for clarity this is the piece of information I'm directly referring to:

By default, the BIG-IP system performs load balancing for each TCP connection, rather than for each HTTP request. After the initial TCP connection is load balanced, the system sends all HTTP requests seen on the same connection to the same pool member. You can change this behavior if you want the system can make a new load balancing decision according to changing persistence information in HTTP requests. You do this by configuring a OneConnect™ profile and assigning the profile to a virtual server. A OneConnect profile causes the system to detach server-side connections so that the system can perform load balancing for each request within the TCP connection and send the HTTP requests to different destination servers if necessary.

According to this the default is to send to the same server, which would be nice for stateful connections, but unneeded when sending to InfluxDB. Switching to the per HTTP request balancing sounds like you want. There will still be a single connection from Telegraf to the load balancer, but the balancer will send each request to a different destination server.

sets GODEBUG=http2client=0

This disables HTTP/2 due to #5905 which can cause an issue recovering if the destination server changes address, but it doesn't modify connection persistency.

perzizzle commented 4 years ago

I'll reach out to our networking folks, I don't believe our F5 LTMs support load balancing per http request. Thanks.

perzizzle commented 4 years ago

Spoke with our networking folks and they said the OneConnect profiles haven't worked well in the past. We are going to pursue using haproxy for loadbalancing to see if we can decouple incoming connections to the LB from the outgoing connection to the database.

influxdata / telegraf