locka99 / opcua

A client and server implementation of the OPC UA specification written in Rust
Mozilla Public License 2.0
476 stars 129 forks source link

Failing to detect lost connection and reconnect #189

Open thosaa opened 2 years ago

thosaa commented 2 years ago

Hi I'm starting a client and is connected to a Microsoft OPC PLC server (mcr.microsoft.com/iotedge/opc-plc).

The client connects, subscribes and get event notifications, however, I'm having an issue with detecting a lost connection and reconnect. I'm using the current master branch, but have seen the same behaviour in 0.8 and 0.9.

The client is started with a quite aggressive retry and time-out:

let client = Box::new(ClientBuilder::new()
        .application_name("CDGs OPC UA Client")
        .application_uri("urn:Discovery")
        .product_uri("urn:OpcuaClient")
        .create_sample_keypair(true)
        .trust_server_certs(true)
        .ignore_clock_skew()
        .session_retry_interval(5000)
        .session_timeout(5000)
        .session_retry_limit(-1)
        .client().unwrap());

And I have set up callback on connection state change:

session.set_connection_status_callback(ConnectionStatusCallback::new(move |connected| { 
      info!("**** Connection status is now: {} ****", connected); 
}));

When I close the OPC server with docker stop {container_id} the client stops getting events and starts with the keep-alive attempt. This is what the log states:

2022-04-07 13:38:35.783Z INFO  opcua_client::session::session - Session activity keep-alive request
2022-04-07 13:38:35.785Z DEBUG opcua_client::message_queue - Request 43 was processed by the server
2022-04-07 13:38:43.783Z INFO  opcua_client::session::session - Session activity keep-alive request
2022-04-07 13:38:43.785Z DEBUG opcua_client::message_queue - Request 44 was processed by the server
2022-04-07 13:38:43.785Z ERROR opcua_client::comms::tcp_transport - Write bytes task IO error Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }
2022-04-07 13:38:51.783Z INFO  opcua_client::session::session - Session activity keep-alive request
2022-04-07 13:38:51.786Z DEBUG opcua_client::message_queue - Request 45 was processed by the server
2022-04-07 13:38:51.786Z ERROR opcua_client::comms::tcp_transport - Write bytes task IO error Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }
2022-04-07 13:38:59.783Z INFO  opcua_client::session::session - Session activity keep-alive request
2022-04-07 13:38:59.786Z DEBUG opcua_client::message_queue - Request 46 was processed by the server
2022-04-07 13:38:59.787Z ERROR opcua_client::comms::tcp_transport - Write bytes task IO error Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }
2022-04-07 13:39:07.783Z INFO  opcua_client::session::session - Session activity keep-alive request
2022-04-07 13:39:07.784Z DEBUG opcua_client::message_queue - Request 47 was processed by the server
2022-04-07 13:39:07.785Z ERROR opcua_client::comms::tcp_transport - Write bytes task IO error Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }

And the connection_status_callback is not called.

Have I missed a configuration?

There is a TODO in session/session_state.rs where it might feasible to get a connection lost error back: https://github.com/locka99/opcua/blob/aaf73a2c729981aa676957871f3fb3b8cfef717c/client/src/session/session_state.rs#L350

I would like to help with implementation and a pull request, but I need some guidance on how this should be done and if the client should be able to reconnect automatically already?

When I start the docker container again, the log just continues as above and no new connection is made. This might just be a symptom for not detecting a lost connection, however.

Regards Thomas

svanharmelen commented 2 years ago

Just FYI, we too sometimes see these broken pipe errors in our logs, but didn't have a change to take a closer look yet...

diohpix commented 2 years ago

I had a same problem , I connected siemens 840d OPC-UA server , after about 8min server disconnected connection but client crashed without retrying

ctron commented 2 years ago

I am having the same issue. I see the following errors in the log:

[2022-04-21T16:18:08Z ERROR opcua::client::comms::tcp_transport] Write bytes task IO error Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }

I guess the client doesn't figure out that that connection is gone.

ctron commented 2 years ago

Taking a look at the code, I guess this error is just ignored:

https://github.com/locka99/opcua/blob/3b052d0f8a7c5fcb99f164ebb31c22d61c8a1d2d/lib/src/client/comms/tcp_transport.rs#L476-L500

ctron commented 2 years ago

I just tried a fix by setting the writer state to finished in case of an error.

However, that results in one of the two Tokio runtimes being dropped in an async context, which panics.

I tried removing both Tokio runtimes, re-using an external (e.g. #[tokio::main]), which works for my use case, but requires users to adopt the full tokio/async concept too.

I created a PR, to showcase the changes.

thosaa commented 2 years ago

Nice work @ctron! I don't think that a fix with breaking changes is the right solution, however, but I could be wrong. We don't use Tokio as the runtime currently but if @locka99 could confirm that this will be the solution to the disconnect I will of cause make the switch to Tokio.

svanharmelen commented 2 years ago

Please see #197 which is trying to add a full async client. Maybe that also solves this problem, or at least the fix can be added there easier without impacting the existing client 🤷🏻‍♂️

ctron commented 2 years ago

Please see #197 which is trying to add a full async client. Maybe that also solves this problem, or at least the fix can be added there easier without impacting the existing client 🤷🏻‍♂️

I fully agree that providing a truly async client should be the future goal. However, I think this goal drags on for a while now. So I really appreciate the effort!

But I also think that there should be a fix for the current code base too. As the issue makes the client unusable from my point of view.

lovasoa commented 2 years ago

@thosaa : Could you test my pull request at https://github.com/locka99/opcua/pull/201 ?

opcua = { git = "https://github.com/lovasoa/opcua", branch = "fix-zombie-tasks" }

I fixed the propagation of IO errors, and the client should now properly reconnect. If it doesn't, then could you post the full verbose client logs (with RUST_OPCUA_LOG=trace) ?

ctron commented 2 years ago

Hm, a lot goes wrong with this one. The on_session_closed handler gets called without a tokio runtime, which cases issues/panic in the actual application using opcua. It also doesn't reconnect. However, that might be the panic.

I am not sure if this can be properly salvaged, without using a truly async approach. Or at least moving the control of the tokio runtime outside of the opcua client.

What I would need to do now, is to spin up yet anther Tokio runtime, and pass that through to all the parts of the application.

So I think this should be solved by:

lovasoa commented 2 years ago

I hadn't thought that people might rely on having a tokio context inside on_session_closed, as there doesn't seem to be any guarantee that it will be avilable anywhere. It's probably not a great idea to rely on it, but I understand there is a lot of blocking code in async contexts in this library anyway, and you may have no other choice.

I changed it back anyway, so that you'll have the context available in on_session_closed. And since the connection thread won't panic, it won't prevent it from reconnecting. Can you try it on your codebase and tell if it works correctly ?

I am not sure if this can be properly salvaged, without using a truly async approach. Or at least moving the control of the tokio runtime outside of the opcua client.

This will have to be done, but the codebase is quite hard to work with, with multiple runtimes running and a lot of manual synchronization between tasks. So I started with simplifying the logic and fixing existing bugs.

milgner commented 2 years ago

I wonder whether this is still an issue with the current master? I saw that the version of write_bytes_task that @ctron referenced has been changed in the meantime, so maybe there's a chance that the connection loss is now detected correctly?