influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.51k stars 5.55k forks source link

jti telemetry doesn't come back after losing connection, manual sighup is needed to let it come back #11286

Closed protonmarco closed 1 year ago

protonmarco commented 2 years ago

Relevant telegraf.conf

[[inputs.jti_openconfig_telemetry]]
  ## List of device addresses to collect telemetry from
  servers = ["10.11.12.13:32767"]
  ## Authentication details. Username and password are must if device expects
  ## authentication. Client ID must be unique when connecting from multiple instances
  ## of telegraf to the same device
  username = "user"
  password = "************"
  client_id = "telegraf"
  ## Frequency to get data
  sample_frequency = "5000ms"
  ## Sensors to subscribe for
  ## A identifier for each sensor can be provided in path by separating with space
  ## Else sensor path will be used as identifier
  ## When identifier is used, we can provide a list of space separated sensors.
  ## A single subscription will be created with all these sensors and data will
  ## be saved to measurement with this identifier name
  sensors = [
   "/interfaces/",
   "/network-instances/network-instance/protocols/protocol/bgp/neighbors/neighbor/state/session-state",
   "collection /components /lldp"
  ]
  ## We allow specifying sensor group level reporting rate. To do this, specify the
  ## reporting rate in Duration at the beginning of sensor paths / collection
  ## name. For entries without reporting rate, we use configured sample frequency
  # sensors = [
  # "1000ms customReporting /interfaces /lldp",
  # "2000ms collection /components",
  # "/interfaces"
  #]
  ## Optional TLS Config
  tls_server_name = "10.11.12.13"
  enable_tls = true
  tls_ca = "/etc/telegraf/ssl/ca-chain.pem"
  tls_cert = "/etc/telegraf/ssl/cert.pem"
  tls_key = "/etc/telegraf/ssl/key.pem"
  ## Use TLS but skip chain & host verification
  insecure_skip_verify = false
  ## Delay between retry attempts of failed RPC calls or streams. Defaults to 1000ms.
  ## Failed streams/calls will not be retried if 0 is provided
  retry_delay = "1000ms"
  ## To treat all string values as tags, set this to true
  str_as_tags = false

Logs from Telegraf

>telemetry was working, then QFX was rebooted
.
Jun 13 09:58:24 my_vm telegraf[20333]: 2022-06-13T07:58:24Z D! [inputs.jti_openconfig_telemetry] Retrying 10.11.12.13 with timeout 1s
Jun 13 09:58:24 my_vm telegraf[20333]: 2022-06-13T07:58:24Z D! [inputs.jti_openconfig_telemetry] Retrying 10.11.12.13 with timeout 1s
Jun 13 09:58:24 my_vm telegraf[20333]: 2022-06-13T07:58:24Z D! [inputs.jti_openconfig_telemetry] Retrying 10.11.12.13 with timeout 1s
Jun 13 09:58:25 my_vm telegraf[20333]: 2022-06-13T07:58:25Z D! [inputs.jti_openconfig_telemetry] Retrying 10.11.12.13 with timeout 1s
Jun 13 09:58:25 my_vm telegraf[20333]: 2022-06-13T07:58:25Z D! [inputs.jti_openconfig_telemetry] Retrying 10.11.12.13 with timeout 1s
Jun 13 09:58:25 my_vm telegraf[20333]: 2022-06-13T07:58:25Z D! [inputs.jti_openconfig_telemetry] Retrying 10.11.12.13 with timeout 1s
Jun 13 09:58:26 my_vm telegraf[20333]: 2022-06-13T07:58:26Z D! [outputs.prometheus_client] Wrote batch of 949 metrics in 95.502957ms
Jun 13 09:58:26 my_vm telegraf[20333]: 2022-06-13T07:58:26Z D! [outputs.prometheus_client] Buffer fullness: 9 / 10000 metrics
Jun 13 09:58:26 my_vm telegraf[20333]: 2022-06-13T07:58:26Z D! [inputs.jti_openconfig_telemetry] Retrying 10.11.12.13 with timeout 1s
Jun 13 09:58:26 my_vm telegraf[20333]: 2022-06-13T07:58:26Z D! [inputs.jti_openconfig_telemetry] Retrying 10.11.12.13 with timeout 1s
Jun 13 09:58:26 my_vm telegraf[20333]: 2022-06-13T07:58:26Z D! [inputs.jti_openconfig_telemetry] Retrying 10.11.12.13 with timeout 1s
Jun 13 09:58:27 my_vm telegraf[20333]: 2022-06-13T07:58:27Z D! [inputs.jti_openconfig_telemetry] Retrying 10.11.12.13 with timeout 1s
Jun 13 09:58:27 my_vm telegraf[20333]: 2022-06-13T07:58:27Z D! [inputs.jti_openconfig_telemetry] Retrying 10.11.12.13 with timeout 1s
Jun 13 09:58:27 my_vm telegraf[20333]: 2022-06-13T07:58:27Z D! [inputs.jti_openconfig_telemetry] Retrying 10.11.12.13 with timeout 1s
.
.
>QFX came back online
.
.

Jun 13 09:59:32 my_vm telegraf[20333]: 2022-06-13T07:59:32Z D! [inputs.jti_openconfig_telemetry] Received from 10.11.12.13:
Jun 13 09:59:32 my_vm telegraf[20333]: 2022-06-13T07:59:32Z D! [inputs.jti_openconfig_telemetry] Available collection for 10.11.12.13 is: []
Jun 13 09:59:32 my_vm telegraf[20333]: 2022-06-13T07:59:32Z E! [inputs.jti_openconfig_telemetry] Error in plugin: failed to read from 10.11.12.13: rpc error: code = Unauthenticated desc = JGrpcServer: Session not authenticated/authorized
Jun 13 09:59:32 my_vm telegraf[20333]: 2022-06-13T07:59:32Z D! [inputs.jti_openconfig_telemetry] Received from 10.11.12.13:
Jun 13 09:59:32 my_vm telegraf[20333]: 2022-06-13T07:59:32Z D! [inputs.jti_openconfig_telemetry] Available collection for 10.11.12.13 is: []
Jun 13 09:59:32 my_vm telegraf[20333]: 2022-06-13T07:59:32Z E! [inputs.jti_openconfig_telemetry] Error in plugin: failed to read from 10.11.12.13: rpc error: code = Unauthenticated desc = JGrpcServer: Session not authenticated/authorized
.
.
>sighup was issued for telegraf
.
.
Jun 13 10:02:32 my_vm telegraf[20333]: 2022-06-13T08:02:32Z D! [inputs.jti_openconfig_telemetry] Received from 10.11.12.13:
Jun 13 10:02:32 my_vm telegraf[20333]: 2022-06-13T08:02:32Z D! [inputs.jti_openconfig_telemetry] Available collection for 10.11.12.13 is: []
Jun 13 10:02:32 my_vm telegraf[20333]: 2022-06-13T08:02:32Z E! [inputs.jti_openconfig_telemetry] Error in plugin: failed to read from 10.11.12.13: rpc error: code = Unauthenticated desc = JGrpcServer: Session not authenticated/authorized
Jun 13 10:03:02 my_vm telegraf[20333]: 2022-06-13T08:03:02Z D! [inputs.jti_openconfig_telemetry] Received from 10.11.12.13: system_id:"QFX-HOSTNAME"  component_id:65535  path:"sensor_1004:/network-instances/network-instance/protocols/protocol/bgp/neigh
Jun 13 10:03:02 my_vm telegraf[20333]: 2022-06-13T08:03:02Z D! [inputs.jti_openconfig_telemetry] Available collection for 10.11.12.13 is: [{3 map[device:10.11.12.13 path:sensor_1004:/network-instances/network-instance/protocols/protocol/bgp/neighbors/nei
Jun 13 10:03:03 my_vm telegraf[20333]: 2022-06-13T08:03:03Z D! [inputs.jti_openconfig_telemetry] Received from 10.11.12.13: system_id:"QFX-HOSTNAME"  path:"sensor_1005_1_1:/junos/system/linecard/transceiver/:/components/:PFE"  sequence_number:1  timest
Jun 13 10:03:03 my_vm telegraf[20333]: 2022-06-13T08:03:03Z D! [inputs.jti_openconfig_telemetry] Available collection for 10.11.12.13 is: [{3 map[device:10.11.12.13 path:sensor_1005_1_1:/junos/system/linecard/transceiver/:/components/:PFE system_id:T-CHS
Jun 13 10:03:04 my_vm telegraf[20333]: 2022-06-13T08:03:04Z D! [inputs.jti_openconfig_telemetry] Received from 10.11.12.13: system_id:"QFX-HOSTNAME"  component_id:65535  path:"sensor_1005_2_1:/components/:/components/:chassisd"  sequence_number:2  time
Jun 13 10:03:04 my_vm telegraf[20333]: 2022-06-13T08:03:04Z D! [inputs.jti_openconfig_telemetry] Received from 10.11.12.13: system_id:"QFX-HOSTNAME"  component_id:65535  path:"sensor_1007_2_1:/interfaces/:/interfaces/:chassisd"  timestamp:1655107383821
Jun 13 10:03:04 my_vm telegraf[20333]: 2022-06-13T08:03:04Z D! [inputs.jti_openconfig_telemetry] Available collection for 10.11.12.13 is: [{3 map[device:10.11.12.13 path:sensor_1007_2_1:/interfaces/:/interfaces/:chassisd system_id:QFX-HOSTNAME] map[_
Jun 13 10:03:04 my_vm telegraf[20333]: 2022-06-13T08:03:04Z D! [inputs.jti_openconfig_telemetry] Available collection for 10.11.12.13 is: [{3 map[device:10.11.12.13 path:sensor_1005_2_1:/components/:/components/:chassisd system_id:QFX-HOSTNAME] map[_
Jun 13 10:03:04 my_vm telegraf[20333]: 2022-06-13T08:03:04Z D! [inputs.jti_openconfig_telemetry] Received from 10.11.12.13: system_id:"QFX-HOSTNAME"  component_id:65535  path:"sensor_1005_3_1:/components/:/components/:jkdsd"  sequence_number:1  timesta
Jun 13 10:03:04 my_vm telegraf[20333]: 2022-06-13T08:03:04Z D! [inputs.jti_openconfig_telemetry] Available collection for 10.11.12.13 is: [{3 map[device:10.11.12.13 path:sensor_1005_3_1:/components/:/components/:jkdsd system_id:QFX-HOSTNAME] map[__ju
Jun 13 10:03:04 my_vm telegraf[20333]: 2022-06-13T08:03:04Z D! [inputs.jti_openconfig_telemetry] Received from 10.11.12.13: system_id:"QFX-HOSTNAME"  component_id:65535  path:"sensor_1006:/lldp/:/lldp/:l2cpd"  sequence_number:1  timestamp:1655107383847
Jun 13 10:03:04 my_vm telegraf[20333]: 2022-06-13T08:03:04Z D! [inputs.jti_openconfig_telemetry] Available collection for 10.11.12.13 is: [{3 map[device:10.11.12.13 path:sensor_1006:/lldp/:/lldp/:l2cpd system_id:QFX-HOSTNAME] map[/lldp/state/chassis-
Jun 13 10:03:04 my_vm telegraf[20333]: 2022-06-13T08:03:04Z D! [inputs.jti_openconfig_telemetry] Received from 10.11.12.13: system_id:"QFX-HOSTNAME"  component_id:65535  path:"sensor_1005_2_1:/components/:/components/:chassisd"  sequence_number:3  time
Jun 13 10:03:04 my_vm telegraf[20333]: 2022-06-13T08:03:04Z D! [inputs.jti_openconfig_telemetry] Available collection for 10.11.12.13 is: [{3 map[device:10.11.12.13 path:sensor_1005_2_1:/components/:/components/:chassisd system_id:QFX-HOSTNAME] map[_
Jun 13 10:03:05 my_vm telegraf[20333]: 2022-06-13T08:03:05Z D! [outputs.prometheus_client] Wrote batch of 919 metrics in 111.271033ms
Jun 13 10:03:05 my_vm telegraf[20333]: 2022-06-13T08:03:05Z D! [outputs.prometheus_client] Buffer fullness: 44 / 10000 metrics

System info

telegraf 1.22.3, on CentOS7 + QFX10002-36Q on 20.4R3-S2.6

Docker

No response

Steps to reproduce

1.start telegraf > telegraf working 2.reboot/lose connection to Juniper device 3.once Juniper device is back, telemetry keeps not working (can't see any metrics regarding the device) 4.kill -1 *telegraf pid* reload telegraf configuration > telemetry is back

Expected behavior

Is expected that telemetry is re-established once the device is reachable again.

Actual behavior

Telemetry doesn't come back with device, the only way to have it back is restart telegraf or reloading its config by issuing a sighup.

Additional info

This same issue has been observed also here #8845 We're pursuing in parallel a case with Juniper to sort this problem and see where the problem lies.

powersj commented 2 years ago

Hi!

Thanks for collecting the logs and showing the scenario so clearly!

Session not authenticated/authorized

It looks like we only ever authenticate during the initial call to Start() this is only run when telegraf first starts up or after you run the SIGHUP, which effectively re-starts telegraf.

next steps: look into possibly creating a draft PR that reauthenticates when telegrafs receive a Unauthenticated message. Have the issue submitter test and see if that overcomes the failures without needing to run SIGHUB.

Thanks!

josephbrosef commented 1 year ago

In the Juniper Telemetry guide Interfaces Telemetry they reference "The external client passes username and password credentials as part of metadata in each RPC. The RPC is allowed if valid credentials are used. Otherwise an error message is returned."

We could remove the 'LoginCheck()' authentication and simply add the credentials to the context metadata, which should solve this problem entirely. Though im not sure if LoginCheck() would have to be kept around for legacy devices? or if creds in metadata is supported by all Junos versions capable of Telemetry.

dmalevski commented 1 year ago

@protonmarco I see something similar.

After some debugging, i saw that Telegraf is using the TCP keepalive timers from the Linux kernel. I lowered those numbers, so after a firewall reload / network outage, Telegraf is establishing new session within few minutes (or when the network connectivity to the Juniper is back). However the problem i see now is that on the Juniper side, there are two ESTABLISHED TCP session, and i think it is trying to send metrics via the first one, as that one hasnt been closed. I need a way to close the initial session, so the new one will be used. The only way i can do this at the moment is by restarting Telegraf agent. Suggestions are welcome

protonmarco commented 1 year ago

Happy to see developments on this thread, currently my personal solution was to switch to the gnmi plugin instead of this one because both of them were satisfying my requirements but the former doesn't suffer of the same problem. I tested it and the connection is recovered rapidly after the switch is back.

@josephbrosef maybe you can have a look at it to see if there are substantial differences with the authentication, I'm always available to test things if needed.

@dmalevski I remember having seen multiple ESTABLISHED TCP sessions while I was playing with changing certificates and reboots to test this issue, but I didn't track exactly what I did to trigger it, I remember I was able to fix it by restarting the process with restart mgd-api immediately, if it can help.

josephbrosef commented 1 year ago

Passing creds via context metadata fixes the problem. No restart to Telegraf is required and it picks back up automatically. Ill try submit the fix in the next few days.

>startup telegraf and connect to VSRX, data collection works OK
...
022-10-21T08:39:26+11:00 D! [agent] Initializing plugins
2022-10-21T08:39:26+11:00 D! [agent] Connecting outputs
2022-10-21T08:39:26+11:00 D! [agent] Attempting connection to [outputs.file]
2022-10-21T08:39:26+11:00 D! [agent] Successfully connected to outputs.file
2022-10-21T08:39:26+11:00 D! [agent] Starting service inputs
2022-10-21T08:39:26+11:00 D! [inputs.jti_openconfig_telemetry] Opened a new gRPC session to 192.168.32.131 on port 50051
2022-10-21T08:39:29+11:00 D! [inputs.jti_openconfig_telemetry] Received from 192.168.32.131: path:"sensor_1000_1_1:/j.............
>reboot the vSRX
...
2022-10-21T08:39:50+11:00 E! [inputs.jti_openconfig_telemetry] Error in plugin: failed to read from 192.168.32.131: rpc error: code = Unavailable desc = error reading from server: EOF
2022-10-21T08:39:51+11:00 D! [inputs.jti_openconfig_telemetry] Retrying 192.168.32.131 with timeout 5s
2022-10-21T08:39:56+11:00 D! [inputs.jti_openconfig_telemetry] Retrying 192.168.32.131 with timeout 5s
2022-10-21T08:40:01+11:00 D! [inputs.jti_openconfig_telemetry] Retrying 192.168.32.131 with timeout 5s
2022-10-21T08:40:26+11:00 D! [inputs.jti_openconfig_telemetry] Retrying 192.168.32.131 with timeout 5s
.................
.................
2022-10-21T08:41:53+11:00 D! [inputs.jti_openconfig_telemetry] Retrying 192.168.32.131 with timeout 5s
2022-10-21T08:41:58+11:00 D! [inputs.jti_openconfig_telemetry] Retrying 192.168.32.131 with timeout 5s
>vSRX now online (data collection starts again) 
..
2022-10-21T08:42:06+11:00 D! [inputs.jti_openconfig_telemetry] Received from 192.168.32.131: path:"sensor_1000_1_1:/j.............

These 'unauthenticated' messages (like in the original post)...

2022-10-21T09:09:39+11:00 D! [inputs.jti_openconfig_telemetry] Received from 192.168.32.131:
2022-10-21T09:09:39+11:00 D! [inputs.jti_openconfig_telemetry] Available collection for 192.168.32.131 is: []
2022-10-21T09:09:39+11:00 E! [inputs.jti_openconfig_telemetry] Error in plugin: failed to read from 192.168.32.131: rpc error: code = Unauthenticated desc = JGrpcServer: Session not authenticated/authorized

Are caused by trying to create a stream with no credentials, against a Juniper device that requires authentication. The weird part is, you can see it prints the 'Received from' and 'Available for collection' debugs, which are after the error checking, only to then throw an error. So the 1st iteration of the below for loop is successful (with no data/blank on the stream), only to then fail on the 2nd iteration with an unauthenticated error (then it breaks and goes back to recreate the stream). I have no idea why the 1st iteration doesn't fail? But I think we should put in a check here for the unauthenticated message and handle it properly. Otherwise it loops infinitely, generating 1000s of logs per second. Forgetting to put the creds in telegraf.conf (hopefully you figure this one out quick), using a domain account to login to the network device that gets locked/changed, or even if the network device loses access to its central auth, could all trigger this annoying infinite logging.

https://github.com/influxdata/telegraf/blob/4002abba2ed00a21f8961defc0f405f229bd9d4f/plugins/inputs/jti_openconfig_telemetry/jti_openconfig_telemetry.go#L260-L291

dmalevski commented 1 year ago

I see this rpc error: code = Unauthenticated desc = JGrpcServer: Session not authenticated/authorized error even with correct credentials. Restart of telegraf seems to fix it. I havent figured out yet why it is happening. At the moment i suspect when telegraf loses network connectivity for a brief amount of time, this happens.

@protonmarco Thanks for the suggestion. I would like to see if this can be fixed before moving everything to gnmi plugin