influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.43k stars 5.55k forks source link

Please merge latest gopcua library #9551

Closed jnangle closed 2 years ago

jnangle commented 3 years ago

Relevant telegraf.conf:

  ## Metric name
  # name = "opcua"
  #
  ## OPC UA Endpoint URL
  endpoint = "opc.tcp://10.79.116.248:4840"
  #
  ## Maximum time allowed to establish a connect to the endpoint.
  connect_timeout = "30s"
  #
  ## Maximum time allowed for a request over the estabilished connection.
  request_timeout = "20s"
  #
  ## Security policy, one of "None", "Basic128Rsa15", "Basic256",
  ## "Basic256Sha256", or "auto"
  security_policy = "None"
  #
  ## Security mode, one of "None", "Sign", "SignAndEncrypt", or "auto"
  security_mode = "None"
  #
  ## Path to cert.pem. Required when security mode or policy isn't "None".
  ## If cert path is not supplied, self-signed cert and key will be generated.
  # certificate = "/etc/telegraf/cert.pem"
  #
  ## Path to private key.pem. Required when security mode or policy isn't "None".
  ## If key path is not supplied, self-signed cert and key will be generated.
  # private_key = "/etc/telegraf/key.pem"
  #
  ## Authentication Method, one of "Certificate", "UserName", or "Anonymous".  To
  ## authenticate using a specific ID, select 'Certificate' or 'UserName'
  auth_method = "Anonymous"
  #
  ## Username. Required for auth_method = "UserName"
  # username = ""
  #
  ## Password. Required for auth_method = "UserName"
  # password = ""
  #
   nodes = [
     {name="CabinetTemp_M1", namespace="2", identifier_type="i", identifier="1030"}
   ]

System info:

I'm currently using Telefraf 1.20.0-6f2553d6 for Windows (amd64), running on Windows 10. I am attempting to connect to a Red Lion DA50 running a Crimson v3.1 OPC UA server.

Telegraf will connect to a Kepware v6.10 and a ProSys OPC UA simulation server. It will not connect to the Red Lion. However, I am able to connect to the Red Lion using the gopcua library v0.1.13.

Steps to reproduce:

  1. If I use the configuration above, I get: Error in plugin: registerNodes failed: The operation timed out. StatusBadTimeout (0x800A0000) when attempting to connect to the Red Lion server using Telegraf.

  2. When I connect to the ProSys server, or the KepWare server, the data logs to my local InfluxDB installation using Telegraf.

  3. I can connect to the Red Lion server from KepWare and UAExpert, using the same connection url: opc.tcp://10.79.116.248:4840, and I am able to read data tags.

Expected behavior:

This is the behavior when running gopcua v0.1.13 monitoring example. I include this to illustrate that the library is able to connect to the Red Lion server.

jnangle-31556s:opcua@v0.1.13 jnangle$ go run examples/monitor/monitor.go -endpoint opc.tcp://10.79.116.248:4840 -node 'ns=2;i=1030' 2021/07/27 13:30:01 *http://opcfoundation.org/UA/SecurityPolicy#NoneMessageSecurityModeNone 2021/07/27 13:30:01 [channel ] sub=4713 ts=2021-07-27T19:28:23Z node=ns=2;i=1030 value=44 2021/07/27 13:30:01 [callback] sub=4714 ts=2021-07-27T19:28:23Z node=ns=2;i=1030 value=44 2021/07/27 13:30:01 [channel ] sub=4713 ts=2021-07-27T19:28:23Z node=ns=2;i=1030 value=44.20000076293945 ^Z [5]+ Stopped go run examples/monitor/monitor.go -endpoint opc.tcp://10.79.116.248:4840 -node 'ns=2;i=1030' jnangle-31556s:opcua@v0.1.13 jnangle$

Actual behavior:

D:\InfluxDB\telegraf-latest\telegraf-1.20.0>.\telegraf --config http://localhost:8086/api/v2/telegrafs/07d830630e2b0000 --test 2021-07-27T22:05:50Z I! Starting Telegraf 2021-07-27T22:05:50Z D! [agent] Initializing plugins 2021-07-27T22:05:50Z I! Failed to load certificate: open /etc/telegraf/cert.pem: The system cannot find the path specified. 2021-07-27T22:05:50Z D! [agent] Starting service inputs 2021-07-27T22:06:11Z E! [inputs.opcua] Error in plugin: registerNodes failed: The operation timed out. StatusBadTimeout (0x800A0000) 2021-07-27T22:06:11Z D! [agent] Stopping service inputs 2021-07-27T22:06:11Z D! [agent] Input channel closed 2021-07-27T22:06:11Z D! [agent] Stopped Successfully 2021-07-27T22:06:11Z E! [telegraf] Error running agent: input plugins recorded 1 errors

D:\InfluxDB\telegraf-latest\telegraf-1.20.0>

Additional info:

srebhan commented 3 years ago

@jnangle can you please give #9560 a try? You might want to use the artifacts built by CI (just click the small black triangle)...

jnangle commented 3 years ago

@srebhan, thank you for working on this so quickly. I dowloaded the new code this afternoon and ran it with the same config I've been using. It still failed to connect to the Red Lion OPC server, but with a different error. Here is the output:

D:\InfluxDB\telegraf-1.20.0 2>telegraf --config http://localhost:8086/api/v2/telegrafs/07d830630e2b0000 --test
2021-07-30T19:41:28Z I! Starting Telegraf
2021-07-30T19:41:28Z D! [agent] Initializing plugins
2021-07-30T19:41:28Z I! Failed to load certificate: open /etc/telegraf/cert.pem: The system cannot find the path specified.
2021-07-30T19:41:28Z D! [agent] Starting service inputs
2021-07-30T19:41:28Z E! [inputs.opcua] Error in plugin: registerNodes failed: EOF
2021-07-30T19:41:28Z D! [agent] Stopping service inputs
2021-07-30T19:41:28Z D! [agent] Input channel closed
2021-07-30T19:41:28Z D! [agent] Stopped Successfully
2021-07-30T19:41:28Z E! [telegraf] Error running agent: input plugins recorded 1 errors

D:\InfluxDB\telegraf-1.20.0 2>

The new Telegraf still connects to the other OPC servers, though.

srebhan commented 3 years ago

@jnangle I added a trial to close the connection entirely to PR #9560. Would be nice if you can give it a try after it is built.

Edit: I also folded #9524 into the above PR to ease testing.

jnangle commented 3 years ago

@srebhan ,

I think I left this comment in the comments for PR9524 - more coffee this morning, I guess.

The code errored out again at the 1-hr mark with the following messages:


2021-08-03T16:00:07Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2021-08-03T16:00:16Z E! [inputs.opcua] Error in plugin: RegisterNodes Read failed: The operation could not complete because the client is not connected to the server. StatusBadServerNotConnected (0x800D0000)
panic: close of closed channel
        panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x8 pc=0x32915ad]

goroutine 7715 [running]:
github.com/gopcua/opcua/uacp.(*Conn).Close(0x0, 0x0, 0x0)
        /go/pkg/mod/github.com/gopcua/opcua@v0.2.0-rc2.0.20210409063412-baabb9b14fd2/uacp/conn.go:166 +0x2d
panic(0x5138100, 0x62858b0)
        /usr/local/go/src/runtime/panic.go:965 +0x1c7
github.com/gopcua/opcua.(*Client).Close(0xc000ec61c0, 0x0, 0x0)
        /go/pkg/mod/github.com/gopcua/opcua@v0.2.0-rc2.0.20210409063412-baabb9b14fd2/client.go:516 +0xe9
github.com/influxdata/telegraf/plugins/inputs/opcua.Connect(0xc00016c340, 0x0, 0x0)
        /go/src/github.com/influxdata/telegraf/plugins/inputs/opcua/opcua_client.go:409 +0x52e
github.com/influxdata/telegraf/plugins/inputs/opcua.(*OpcUA).Gather(0xc00016c340, 0x6414658, 0xc000a3a240, 0xdb5b55, 0xc001387020)
        /go/src/github.com/influxdata/telegraf/plugins/inputs/opcua/opcua_client.go:522 +0x5c5
github.com/influxdata/telegraf/models.(*RunningInput).Gather(0xc00013c690, 0x6414658, 0xc000a3a240, 0xdf661d, 0x5cc0b80)
        /go/src/github.com/influxdata/telegraf/models/running_input.go:117 +0x74
github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1(0xc000750de0, 0xc00013c690, 0x6414658, 0xc000a3a240)
        /go/src/github.com/influxdata/telegraf/agent/agent.go:469 +0x46
created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce
        /go/src/github.com/influxdata/telegraf/agent/agent.go:468 +0xc5

C:\Users\scada_read\Desktop\Installers\InfluxDB\telegraf-1.20.0 3>```
srebhan commented 3 years ago

I fear we have to dig deeper into that library. Will look into this after my time-off (i.e. in 2 to 3 weeks). Sorry, seems like there is no quick solution. :-(

srebhan commented 3 years ago

@jnangle can you please check if PR #9583 fixes your problem? Maybe in combination with PR #9524...

jnangle commented 3 years ago

Telegraf no longer errors out on disconnect. However, still will not connect to Red Lion OPC UA server:


2021-08-05T15:03:28Z I! Starting Telegraf
2021-08-05T15:03:28Z D! [agent] Initializing plugins
2021-08-05T15:03:28Z I! Failed to load certificate: open /etc/telegraf/cert.pem: The system cannot find the path specified.
2021-08-05T15:03:28Z D! [agent] Starting service inputs
2021-08-05T15:03:49Z E! [inputs.opcua] Error in plugin: registerNodes failed: The operation timed out. StatusBadTimeout (0x800A0000)
2021-08-05T15:03:49Z D! [agent] Stopping service inputs
2021-08-05T15:03:49Z D! [agent] Input channel closed
2021-08-05T15:03:49Z D! [agent] Stopped Successfully
2021-08-05T15:03:49Z E! [telegraf] Error running agent: input plugins recorded 1 errors

D:\InfluxDB\Telegraf PR9583\telegraf-1.20.0>```
jnangle commented 3 years ago

@srebhan ,

I hope you're rested after your break.

I wanted to check in and see if there were any updates to this request. I was able to connect Telegraf to a Beckhoff OPC UA server, but found that v1.19.3 doesn't support array variables from the server. As with the Red Lion connection problem, I tested the Beckhoff connection with gopcua 0.13 and found that I was able to read array variables. So it seems that incorporating the latest gopcua library would solve two problems for and likely others, too.

Thank you,

srebhan commented 2 years ago

@jnangle well holidays are always too short. :-)

Let's try to keep those two things (connection problem, array variables) separated. Please open another issue for the array thing and feel free to mention/assign me.

Regarding the connection problem, it looks like it fixes something but now break elsewhere. Is my understanding correct? You tried with the tool they provide and it works reliably (even after that 1 hour)?

jnangle commented 2 years ago

Hello @srebhan ,

Thank you for the message - I agree that holidays are always too short. I will open a new ticket for the array variable support.

To clarify on the connection support:

Telegraf 1.20.0-4 has been running successfully with a Kepware server for a couple of weeks. This same version will also connect to my Beckhoff server. It will not connect to my Red Lion server.

I noticed something in the connection that might benefit your troubleshooting: The Red Lion publishes OPC-UA tags using a numeric tag id and the namespace, e.g."ns=2;i=1764". The Kepware and Beckhoff servers expose the tags using string identifiers, e.g: "ns=2;s=MAIN.fPhase_A_Voltage".

I have downloaded and successfully tested another tool from Factry.io, called OPC-Datalogger. The tool connects to OPC-UA servers and logs data to an InfluxDB bucket. The Factry tool is able to connect to both Beckhoff and Red Lion servers and collect data. So, I'm wondering if there's something about the way that Telegraf attempts to connect to OPC tags with numeric identifiers that is causing it to fail.

Regarding the gocpua libraries, I have downloaded version 0.1.13 and, using the example code, I am able to connect to the Red Lion server and monitor tags using numeric tags IDs as well as the Beckhoff server using string IDs. Incidentally, I am also able to monitor array variables using this code.

I hope this helps, let me know if I can provide more information.

srebhan commented 2 years ago

Ok good, so we are trouble-shooting the Red Lion part then... Let me see if I can find some time to stray some debug messages over the code and see where it differs between that Red Lion server and the other two (thanks for the hint!). Can you help debugging that problem? Are you avail on Slack for faster cycles?

Unfortunately that other data-logger won't give use too much insights as it doesn't use the underlying library and thus probably won't trigger the issue. I guess it's some golang or library or telegraf specific thing... Let's track that down!

jnangle commented 2 years ago

Hi @srebhan,

Yes, it's just the Red Lion numeric ID that are not working. I am on Slack if that is an easier/faster way to get things tested. I am available to help debug, just let me know. My timezone is UTS-6:00, so I might be a little delayed relative to you.