Closed dmalevski closed 1 year ago
After some debugging, i saw that Telegraf is using the TCP keepalive timers from the Linux kernel. I lowered those numbers, so after a firewall reload , and Telegraf is establishing new session within few minutes. However the problem i see now is that on the Juniper side, there are two ESTABLISHED TCP session, and it is trying to send metrics via the first one, as it hasnt received TCP fin/reset. I need a way to close the initial session, so the new one will be used. The only way i can do this at the moment is by restarting Telegraf agent. Suggestions are welcome
Do you have logs from Telegraf when it fails to connect during a firewall reload? I'd like to see where it is failing and then we could look at restarting the client. Outside of that I'm not sure what else we can do
Could this just be the same problem as issue 11286 (sorry link isnt working)? As Telegraf will need re authenticate after the firewall comes back online after a reload (which it doesn't currently do due to auth being in the Start() method), and restarting Telegraf would trigger authentication etc again and fix.
@josephbrosef Looks like the same problem, however i dont see what can i do at this moment to fix the issue. Want me to test anything?
@dmalevski and @josephbrosef can you please test the binary in #13709 once CI finished the tests successfully!? Let me know if this fixes your issue!
Relevant telegraf.conf
Logs from Telegraf
System info
OS: Rocky 8.6, Telegraf_version: 1.23.4-1
Docker
No response
Steps to reproduce
I am using the Telegraf jti_openconfig_telemetry plugin to monitor Juniper devices.
When there is a network firewall outage which sit between my telegraf agent and the Juniper device, TCP session is still seen as established on the Juniper side, so metric which are send by this Juniper are being blocked (as they are seen as part of a TCP session which does not exist). Restarting the telegraf agent established a new TCP session and metrics are being received after that.
Expected behavior
Having a TCP keepalive option would solve this. If the Telegraf agent sees that it cannot to connect to the Juniper, it should try to establish new session after x amount of seconds.
Actual behavior
TCP session if not torn down, so the Juniper keep sending metrics, which are not received by the Telegraf agent.
Additional info
No response