Open thosaa opened 2 years ago
Just FYI, we too sometimes see these broken pipe errors in our logs, but didn't have a change to take a closer look yet...
I had a same problem , I connected siemens 840d OPC-UA server , after about 8min server disconnected connection but client crashed without retrying
I am having the same issue. I see the following errors in the log:
[2022-04-21T16:18:08Z ERROR opcua::client::comms::tcp_transport] Write bytes task IO error Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }
I guess the client doesn't figure out that that connection is gone.
Taking a look at the code, I guess this error is just ignored:
I just tried a fix by setting the writer state to finished in case of an error.
However, that results in one of the two Tokio runtimes being dropped in an async context, which panics.
I tried removing both Tokio runtimes, re-using an external (e.g. #[tokio::main]
), which works for my use case, but requires users to adopt the full tokio/async concept too.
I created a PR, to showcase the changes.
Nice work @ctron! I don't think that a fix with breaking changes is the right solution, however, but I could be wrong. We don't use Tokio as the runtime currently but if @locka99 could confirm that this will be the solution to the disconnect I will of cause make the switch to Tokio.
Please see #197 which is trying to add a full async client. Maybe that also solves this problem, or at least the fix can be added there easier without impacting the existing client 🤷🏻♂️
Please see #197 which is trying to add a full async client. Maybe that also solves this problem, or at least the fix can be added there easier without impacting the existing client 🤷🏻♂️
I fully agree that providing a truly async client should be the future goal. However, I think this goal drags on for a while now. So I really appreciate the effort!
But I also think that there should be a fix for the current code base too. As the issue makes the client unusable from my point of view.
@thosaa : Could you test my pull request at https://github.com/locka99/opcua/pull/201 ?
opcua = { git = "https://github.com/lovasoa/opcua", branch = "fix-zombie-tasks" }
I fixed the propagation of IO errors, and the client should now properly reconnect. If it doesn't, then could you post the full verbose client logs (with RUST_OPCUA_LOG=trace
) ?
Hm, a lot goes wrong with this one. The on_session_closed
handler gets called without a tokio runtime, which cases issues/panic in the actual application using opcua
. It also doesn't reconnect. However, that might be the panic.
I am not sure if this can be properly salvaged, without using a truly async approach. Or at least moving the control of the tokio runtime outside of the opcua
client.
What I would need to do now, is to spin up yet anther Tokio runtime, and pass that through to all the parts of the application.
So I think this should be solved by:
I hadn't thought that people might rely on having a tokio context inside on_session_closed, as there doesn't seem to be any guarantee that it will be avilable anywhere. It's probably not a great idea to rely on it, but I understand there is a lot of blocking code in async contexts in this library anyway, and you may have no other choice.
I changed it back anyway, so that you'll have the context available in on_session_closed
. And since the connection thread won't panic, it won't prevent it from reconnecting. Can you try it on your codebase and tell if it works correctly ?
I am not sure if this can be properly salvaged, without using a truly async approach. Or at least moving the control of the tokio runtime outside of the opcua client.
This will have to be done, but the codebase is quite hard to work with, with multiple runtimes running and a lot of manual synchronization between tasks. So I started with simplifying the logic and fixing existing bugs.
I wonder whether this is still an issue with the current master
? I saw that the version of write_bytes_task
that @ctron referenced has been changed in the meantime, so maybe there's a chance that the connection loss is now detected correctly?
Hi I'm starting a client and is connected to a Microsoft OPC PLC server (mcr.microsoft.com/iotedge/opc-plc).
The client connects, subscribes and get event notifications, however, I'm having an issue with detecting a lost connection and reconnect. I'm using the current master branch, but have seen the same behaviour in 0.8 and 0.9.
The client is started with a quite aggressive retry and time-out:
And I have set up callback on connection state change:
When I close the OPC server with
docker stop {container_id}
the client stops getting events and starts with the keep-alive attempt. This is what the log states:And the connection_status_callback is not called.
Have I missed a configuration?
There is a TODO in session/session_state.rs where it might feasible to get a connection lost error back: https://github.com/locka99/opcua/blob/aaf73a2c729981aa676957871f3fb3b8cfef717c/client/src/session/session_state.rs#L350
I would like to help with implementation and a pull request, but I need some guidance on how this should be done and if the client should be able to reconnect automatically already?
When I start the docker container again, the log just continues as above and no new connection is made. This might just be a symptom for not detecting a lost connection, however.
Regards Thomas