Closed protonmarco closed 1 year ago
Hi!
Thanks for collecting the logs and showing the scenario so clearly!
Session not authenticated/authorized
It looks like we only ever authenticate during the initial call to Start()
this is only run when telegraf first starts up or after you run the SIGHUP, which effectively re-starts telegraf.
next steps: look into possibly creating a draft PR that reauthenticates when telegrafs receive a Unauthenticated
message. Have the issue submitter test and see if that overcomes the failures without needing to run SIGHUB.
Thanks!
In the Juniper Telemetry guide Interfaces Telemetry they reference "The external client passes username and password credentials as part of metadata in each RPC. The RPC is allowed if valid credentials are used. Otherwise an error message is returned."
We could remove the 'LoginCheck()' authentication and simply add the credentials to the context metadata, which should solve this problem entirely. Though im not sure if LoginCheck() would have to be kept around for legacy devices? or if creds in metadata is supported by all Junos versions capable of Telemetry.
@protonmarco I see something similar.
After some debugging, i saw that Telegraf is using the TCP keepalive timers from the Linux kernel. I lowered those numbers, so after a firewall reload / network outage, Telegraf is establishing new session within few minutes (or when the network connectivity to the Juniper is back). However the problem i see now is that on the Juniper side, there are two ESTABLISHED TCP session, and i think it is trying to send metrics via the first one, as that one hasnt been closed. I need a way to close the initial session, so the new one will be used. The only way i can do this at the moment is by restarting Telegraf agent. Suggestions are welcome
Happy to see developments on this thread, currently my personal solution was to switch to the gnmi plugin instead of this one because both of them were satisfying my requirements but the former doesn't suffer of the same problem. I tested it and the connection is recovered rapidly after the switch is back.
@josephbrosef maybe you can have a look at it to see if there are substantial differences with the authentication, I'm always available to test things if needed.
@dmalevski I remember having seen multiple ESTABLISHED TCP sessions while I was playing with changing certificates and reboots to test this issue, but I didn't track exactly what I did to trigger it, I remember I was able to fix it by restarting the process with restart mgd-api immediately
, if it can help.
Passing creds via context metadata fixes the problem. No restart to Telegraf is required and it picks back up automatically. Ill try submit the fix in the next few days.
>startup telegraf and connect to VSRX, data collection works OK
...
022-10-21T08:39:26+11:00 D! [agent] Initializing plugins
2022-10-21T08:39:26+11:00 D! [agent] Connecting outputs
2022-10-21T08:39:26+11:00 D! [agent] Attempting connection to [outputs.file]
2022-10-21T08:39:26+11:00 D! [agent] Successfully connected to outputs.file
2022-10-21T08:39:26+11:00 D! [agent] Starting service inputs
2022-10-21T08:39:26+11:00 D! [inputs.jti_openconfig_telemetry] Opened a new gRPC session to 192.168.32.131 on port 50051
2022-10-21T08:39:29+11:00 D! [inputs.jti_openconfig_telemetry] Received from 192.168.32.131: path:"sensor_1000_1_1:/j.............
>reboot the vSRX
...
2022-10-21T08:39:50+11:00 E! [inputs.jti_openconfig_telemetry] Error in plugin: failed to read from 192.168.32.131: rpc error: code = Unavailable desc = error reading from server: EOF
2022-10-21T08:39:51+11:00 D! [inputs.jti_openconfig_telemetry] Retrying 192.168.32.131 with timeout 5s
2022-10-21T08:39:56+11:00 D! [inputs.jti_openconfig_telemetry] Retrying 192.168.32.131 with timeout 5s
2022-10-21T08:40:01+11:00 D! [inputs.jti_openconfig_telemetry] Retrying 192.168.32.131 with timeout 5s
2022-10-21T08:40:26+11:00 D! [inputs.jti_openconfig_telemetry] Retrying 192.168.32.131 with timeout 5s
.................
.................
2022-10-21T08:41:53+11:00 D! [inputs.jti_openconfig_telemetry] Retrying 192.168.32.131 with timeout 5s
2022-10-21T08:41:58+11:00 D! [inputs.jti_openconfig_telemetry] Retrying 192.168.32.131 with timeout 5s
>vSRX now online (data collection starts again)
..
2022-10-21T08:42:06+11:00 D! [inputs.jti_openconfig_telemetry] Received from 192.168.32.131: path:"sensor_1000_1_1:/j.............
These 'unauthenticated' messages (like in the original post)...
2022-10-21T09:09:39+11:00 D! [inputs.jti_openconfig_telemetry] Received from 192.168.32.131:
2022-10-21T09:09:39+11:00 D! [inputs.jti_openconfig_telemetry] Available collection for 192.168.32.131 is: []
2022-10-21T09:09:39+11:00 E! [inputs.jti_openconfig_telemetry] Error in plugin: failed to read from 192.168.32.131: rpc error: code = Unauthenticated desc = JGrpcServer: Session not authenticated/authorized
Are caused by trying to create a stream with no credentials, against a Juniper device that requires authentication. The weird part is, you can see it prints the 'Received from' and 'Available for collection' debugs, which are after the error checking, only to then throw an error. So the 1st iteration of the below for loop is successful (with no data/blank on the stream), only to then fail on the 2nd iteration with an unauthenticated error (then it breaks and goes back to recreate the stream). I have no idea why the 1st iteration doesn't fail? But I think we should put in a check here for the unauthenticated message and handle it properly. Otherwise it loops infinitely, generating 1000s of logs per second. Forgetting to put the creds in telegraf.conf (hopefully you figure this one out quick), using a domain account to login to the network device that gets locked/changed, or even if the network device loses access to its central auth, could all trigger this annoying infinite logging.
I see this rpc error: code = Unauthenticated desc = JGrpcServer: Session not authenticated/authorized
error even with correct credentials. Restart of telegraf seems to fix it. I havent figured out yet why it is happening.
At the moment i suspect when telegraf loses network connectivity for a brief amount of time, this happens.
@protonmarco Thanks for the suggestion. I would like to see if this can be fixed before moving everything to gnmi
plugin
Relevant telegraf.conf
Logs from Telegraf
System info
telegraf 1.22.3, on CentOS7 + QFX10002-36Q on 20.4R3-S2.6
Docker
No response
Steps to reproduce
1.start telegraf > telegraf working 2.reboot/lose connection to Juniper device 3.once Juniper device is back, telemetry keeps not working (can't see any metrics regarding the device) 4.
kill -1 *telegraf pid*
reload telegraf configuration > telemetry is backExpected behavior
Is expected that telemetry is re-established once the device is reachable again.
Actual behavior
Telemetry doesn't come back with device, the only way to have it back is restart telegraf or reloading its config by issuing a sighup.
Additional info
This same issue has been observed also here #8845 We're pursuing in parallel a case with Juniper to sort this problem and see where the problem lies.