influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.67k stars 5.59k forks source link

gNMI and jti_openconfig_telemetry TLS issues #8181

Closed jdratlif closed 3 years ago

jdratlif commented 4 years ago

We've been using telegraf to collect streaming telemetry from Juniper routers. It's been working well for us, but we had to make a custom output plugin. With the release of 1.15 and the execd output plugin, we want to switch to that and stop compiling telegraf ourselves.

However, when I tried the config we were using in telegraf 1.15.3 from the redhat 7 repos direct from telegraf, I couldn't connect. The telegraf just says it's retrying the connection over and over. When I look at the Juniper logs, I see TLS errors.

Sep 23 18:54:32 chttp2_server.c:83: Handshaking failed: {"created":"@1600887272.560535833","description":"Cannot check peer: missing selected ALPN property.","file":"../../../../../../../../src/external/bsd/grpc/dist/src/core/lib/security/transport/security_connector.c","file_line":589}

After rebuilding VMX instances and reissuing certificates, I tried compiling telegraf from source. I get a slightly different error message, but the same problem.

Sep 23 18:55:35 chttp2_server.c:83: Handshaking failed: {"created":"@1600887335.591207967","description":"Handshake failed","file":"../../../../../../../../src/external/bsd/grpc/dist/src/core/lib/security/transport/security_handshaker.c","file_line":276,"tsi_code":10,"tsi_error":"TSI_PROTOCOL_FAILURE"}

I decided to try compiling with an older version of golang. If I compile telegraf with golang 1.13, everything works. If I use golang 1.14 or 1.15, it does not.

I'm not sure if this is a golang issue, a telegraf issue, a juniper issue, or something else. I asked about this in discord and they noticed this in the golang 1.14 release notes.

https://golang.org/doc/go1.14#minor_library_changes

The tls package no longer supports the legacy Next Protocol Negotiation (NPN) extension and now only supports ALPN. In previous releases it supported both. There are no API changes and applications should function identically as before. Most other clients and servers have already removed NPN support in favor of the standardized ALPN.

That suggests it could be a Juniper issue. I am working on talking to them as well, but I decided to file this here in case the problem isn't with Juniper.

Relevant telegraf.conf:

[[inputs.jti_openconfig_telemetry]]

  # An array of strings containing the name of a host to collect from and the port to connect on (colon-delimited)
  servers = [
    "vmx1.grnoc.iu.edu:7443",
  ]

  # The greatest common ancestor path of all sensors listed in the [[outputs.tsds]] plugin
  sensors          = ["/interfaces/"]

  # The client ID Telegraf uses to connect to the servers
  # NOTE: There is a limit set on most devices for number of sessions per client_id
  client_id        = "grnoc-telegraf"

  # How often nodes should send data (in milliseconds like "60000ms")
  sample_frequency = "30000ms"

  # How often to wait before retrying a connection to a server (in milliseconds like "10000ms")
  retry_delay      = "15000ms"

  # Enabled whenever TLS is being used to authenticate the connection to the servers
  enable_tls       = true
  tls_ca           = "/etc/pki/tls/certs/telegraf/grnoc/vmx/ca.cert"
  tls_cert         = "/etc/pki/tls/certs/telegraf/grnoc/vmx/io3.bldc.grnoc.iu.edu.cert"
  tls_key          = "/etc/pki/tls/private/telegraf/grnoc/vmx/io3.bldc.grnoc.iu.edu.key"
  str_as_tags      = false

System info:

telegraf 1.15.3 CentOS 7

Steps to reproduce:

Try to collect gNMI telemetry on a Juniper router with upstream builds of telegraf configured with TLS. It will always fail to connect. The Juniper logs will say there is a TLS issue.

Expected behavior:

That the connection works and I get interface data streamed back to telegraf.

Actual behavior:

It fails to connect with TLS errors on the Juniper logs.

Additional info:

danatinflux commented 3 years ago

This appears to be something happening to Juniper gear specifically. OS release notes from their site appear to show fixes for ALPN issues.

Closing.

sjwang90 commented 3 years ago

Have we determine that this issue is due to Juniper? Going to close, feel free to re-open if it's not a Juniper problem but a Telegraf one.