influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.6k stars 5.57k forks source link

Streaming telemetry ``jti_openconfig_telemetry`` plugin does not support TCP keepalive #12017

Closed dmalevski closed 1 year ago

dmalevski commented 2 years ago

Relevant telegraf.conf

# # Subscribe and receive OpenConfig Telemetry data using JTI
 [[inputs.jti_openconfig_telemetry]]
#   ## List of device addresses to collect telemetry from
    servers = ["x.x.x.x:32767"]
#   ## Authentication details. Username and password are must if device expects
#   ## authentication. Client ID must be unique when connecting from multiple instances
#   ## of telegraf to the same device
    username = 'telegraf'
    password = 'password'
    client_id = "telegraf"
#
#   ## Frequency to get data
    sample_frequency = "60000ms"
    retry_delay = "5000ms"
#   ## Sensors to subscribe for
#   ## A identifier for each sensor can be provided in path by separating with space
#   ## Else sensor path will be used as identifier
#   ## When identifier is used, we can provide a list of space separated sensors.
#   ## A single subscription will be created with all these sensors and data will
#   ## be saved to measurement with this identifier name
#  sensors = [
#   "/interfaces/interface/state/counters/in-octets"
#]
sensors = [
  "interface /interfaces",
  "mem /junos/system/linecard/cpu/memory",
  "npu /junos/system/linecard/npu/utilization/",
  "bgp_session_state /network-instances/network-instance/protocols/protocol/bgp/neighbors/neighbor/state/session-state",
  "bgp_prefixes /network-instances/network-instance/protocols/protocol/bgp/neighbors/neighbor/afi-safis/afi-safi/state/prefixes/",
  "number_routes /bgp-rib/afi-safis/afi-safi/ipv4-unicast/loc-rib/num-routes"
 ]

  ## Optional TLS Config
#   enable_tls = true
#   tls_ca = "/etc/telegraf/ca.crt"
#   tls_cert = "/etc/telegraf/telegraf.crt"
#   tls_key = "/etc/telegraf/telegraf.key"
  # Use TLS but skip chain & host verification
#   insecure_skip_verify = false

Logs from Telegraf

[root@juniper-exporter] (juniper-exporter) # sudo -u telegraf telegraf -config telegraf.conf --debug
2022-10-14T08:59:07Z I! Starting Telegraf 1.23.4
2022-10-14T08:59:07Z I! Loaded inputs: jti_openconfig_telemetry (8x)
2022-10-14T08:59:07Z I! Loaded aggregators: 
2022-10-14T08:59:07Z I! Loaded processors: 
2022-10-14T08:59:07Z I! Loaded outputs: prometheus_client
2022-10-14T08:59:07Z I! Tags enabled: host=juniper-exporter
2022-10-14T08:59:07Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"juniper-exporter", Flush Interval:10s
2022-10-14T08:59:07Z D! [agent] Initializing plugins
2022-10-14T08:59:07Z D! [agent] Connecting outputs
2022-10-14T08:59:07Z D! [agent] Attempting connection to [outputs.prometheus_client]
====================================
[root@juniper-exportertelegraf] (juniper-exporter) # systemctl status telegraf
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
   Loaded: loaded (/usr/lib/systemd/system/telegraf.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2022-10-14 09:53:02 CEST; 1h 7min ago
     Docs: https://github.com/influxdata/telegraf
  Process: 3354683 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
 Main PID: 3356425 (telegraf)
    Tasks: 10 (limit: 100832)
   Memory: 86.6M
   CGroup: /system.slice/telegraf.service
           └─3356425 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

Oct 14 09:53:02 juniper-exporter telegraf[3356425]: 2022-10-14T07:53:02Z I! Loaded processors:
Oct 14 09:53:02 juniper-exporter telegraf[3356425]: 2022-10-14T07:53:02Z I! Loaded outputs: prometheus_client
Oct 14 09:53:02 juniper-exporte telegraf[3356425]: 2022-10-14T07:53:02Z I! Tags enabled: host=juniper-exporter
Oct 14 09:53:02 juniper-exporter telegraf[3356425]: 2022-10-14T07:53:02Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"juniper-exporter", F>
Oct 14 09:53:02 juniper-exporter telegraf[3356425]: 2022-10-14T07:53:02Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics
Oct 14 09:53:02 juniper-exporter systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Oct 14 09:53:22 juniper-exporter telegraf[3356425]: 2022-10-14T07:53:22Z E! [inputs.jti_openconfig_telemetry] Could not initiate login check for x.x.x.x:32767: rpc error: code = Unava>
Oct 14 09:53:42 juniper-exporter telegraf[3356425]: 2022-10-14T07:53:42Z E! [inputs.jti_openconfig_telemetry] Could not initiate login check for x.x.x.x:32767: rpc error: code = Unava

System info

OS: Rocky 8.6, Telegraf_version: 1.23.4-1

Docker

No response

Steps to reproduce

I am using the Telegraf jti_openconfig_telemetry plugin to monitor Juniper devices.

When there is a network firewall outage which sit between my telegraf agent and the Juniper device, TCP session is still seen as established on the Juniper side, so metric which are send by this Juniper are being blocked (as they are seen as part of a TCP session which does not exist). Restarting the telegraf agent established a new TCP session and metrics are being received after that.

Expected behavior

Having a TCP keepalive option would solve this. If the Telegraf agent sees that it cannot to connect to the Juniper, it should try to establish new session after x amount of seconds.

Actual behavior

TCP session if not torn down, so the Juniper keep sending metrics, which are not received by the Telegraf agent.

Additional info

No response

dmalevski commented 2 years ago

After some debugging, i saw that Telegraf is using the TCP keepalive timers from the Linux kernel. I lowered those numbers, so after a firewall reload , and Telegraf is establishing new session within few minutes. However the problem i see now is that on the Juniper side, there are two ESTABLISHED TCP session, and it is trying to send metrics via the first one, as it hasnt received TCP fin/reset. I need a way to close the initial session, so the new one will be used. The only way i can do this at the moment is by restarting Telegraf agent. Suggestions are welcome

powersj commented 2 years ago

Do you have logs from Telegraf when it fails to connect during a firewall reload? I'd like to see where it is failing and then we could look at restarting the client. Outside of that I'm not sure what else we can do

josephbrosef commented 2 years ago

Could this just be the same problem as issue 11286 (sorry link isnt working)? As Telegraf will need re authenticate after the firewall comes back online after a reload (which it doesn't currently do due to auth being in the Start() method), and restarting Telegraf would trigger authentication etc again and fix.

dmalevski commented 2 years ago

@josephbrosef Looks like the same problem, however i dont see what can i do at this moment to fix the issue. Want me to test anything?

srebhan commented 1 year ago

@dmalevski and @josephbrosef can you please test the binary in #13709 once CI finished the tests successfully!? Let me know if this fixes your issue!