locka99 / opcua

A client and server implementation of the OPC UA specification written in Rust
Mozilla Public License 2.0
480 stars 129 forks source link

Possible performance issue / best practices #139

Closed milgner closed 2 years ago

milgner commented 2 years ago

After familiarizing myself enough with the library to get our PoC application working, I decided to look into why - using this library as a client - it seems to be a lot slower than my previous experience with, for example gopcua, even though the code is written in Rust.

Right now I am looking at a TCP dump ( tcp_performance.csv) and there are noticeable delays between receiving the CreateSessionResponse and sending out the ActivateSessionRequest in Session::connect_and_activate whereas the time between CreateSessionRequest and CreateSessionResponse , i.e. the time it takes for the server to respond, is barely noticable. This suggests to me that the problem lies with the client code and not on the server.

Looking at the whole dump, it is noticeable that the delays in between a request coming in and the next request going out is always more or less 0.05 seconds. But I chose Session::connect_and_activate as an example because nothing else is happening between these two calls and there is a noticeable delay nonetheless.

Are there any best practices to follow to get rid of this? Maybe I have to tweak some Tokio runtime parameters or somesuch to make it faster? Is anyone else experiencing the same problem by any chance?

milgner commented 2 years ago

Just to try a low-hanging fruit, I tried upgrading Tokio to the latest version, 1.12. Unfortunately that didn't change anything - will continue to investigate.

milgner commented 2 years ago

Found the issue. Seems like my guess about tokio having something to do with it wasn't half bad. As I was integrating the library in a Ruby application, there wasn't any #[tokio::main] pragma on the my initialization function. After adding that, everything is running now as fast as one would expect :rocket:

sysarcher commented 2 years ago

@milgner thanks for the comments here... I was actually following your issue very closely (out of curiosity) even though I'm not qualified to help out here (no experience with this lib at all, but interest nevertheless)....

Thanks for narrating the thought process throughout

milgner commented 2 years ago

Unfortunately, that was a short-lived success. No idea why it worked fine that one time (did a full recursive browse through the server) but suddenly, just a few minutes later, everything is back to the initial behaviour. Any ideas welcome. To ensure that the Ruby integration isn't somehow to blame, I also created a benchmark test now that browses the server recursively. Will go deeper behind the scenes now...

schroeder- commented 2 years ago

Can you check if multi_threaded_executor or single_threaded_executor changes anything?

in config:

performance:
  single_threaded_executor: false

or in code:

ClientBuilder::new()
....
.multi_threaded_executor()
milgner commented 2 years ago

Can you check if multi_threaded_executor or single_threaded_executor changes anything?

ClientBuilder::new() .... .multi_threaded_executor()

Thank you for the suggestion! Gave it a try but unfortunately it doesn't change anything in terms of performance. :worried:

schroeder- commented 2 years ago

I think I got the root. In client/src/session/session_state.rs there is a sync_polling_periode of 50ms, that is hit every request you make when waiting for the response. const SYNC_POLLING_PERIOD: u64 = 50; See SessionState::wait_for_sync_response As workaround you can lower that value.

milgner commented 2 years ago

I think I got the root. In client/src/session/session_state.rs there is a sync_polling_periode of 50ms, that is hit every request you make when waiting for the response. const SYNC_POLLING_PERIOD: u64 = 50; See SessionState::wait_for_sync_response As workaround you can lower that value.

Success! Just at the moment you wrote, I had arrived at the same place in the code. Lowering it from 50 to 5 sped up the browse from 24 seconds to 3 seconds. Further reducing it to 1 results in even more speed-up, making everything go through in 1 second.

I saw in the README that there are already plans about getting rid of the sync and polling code. Is there anything concrete yet? Anything I can do to assist? Being a noob in Rust and not having read through the whole code, I'm hesitant to make suggestions but if there are existing ideas, I'd be willing to take a shot at implementing something.

milgner commented 2 years ago

Just had a look at the functionality from the std::sync crate. What do you think about adding a CondVar to the MessageQueue that checks whether data is available and gets notified as it comes in? Not sure whether it will work without looking into how Tokio and threads work exactly but I think I'll give it a try.

Basically I'm thinking about replacing the polling mechanism with https://doc.rust-lang.org/std/sync/struct.Condvar.html#method.wait_timeout

milgner commented 2 years ago

I took a few shots at implementing this but now that I realized how Tokio uses a green-thread mechanism for concurrency, I am not surprised that my initial approach did not work since waiting in the wait_for_message part - even if it's just a timeout - will ultimately prevent the writing part of the message queue from running.

It's possible that there's still a way that I don't see, though. Performance with the polling interval reduced from 50 to 1 is passable right now but I think there's still much potential performance to be gained by converting to a reactive async/await based approach.

With the release of Rust 2021 just around the corner, what do you think about enhancing the API accordingly? If desired, there could be a backwards-compatible API that uses Future#poll very similar to the current implementation but it would also be possible to just return the real future to async-aware API clients.

locka99 commented 2 years ago

The client side code probably has room for improvement in its implementation. It's synchronous on the surface but async underneath and the two halves aren't married as optimally as they probably could be. In part this is because the code somewhat predates tokio, but also because marrying a sync API to an async tokio is a bit clunky.

The main points of potential latency:

  1. The add_request() onto the message queue - an unbounded tokio sender takes the request and sends it to a tokio task for receiving.
  2. The tokio task TcpTransport.spawn_writing_task that recv's the request and writes it to an output buffer
  3. The tokio task TcpTransport.write_bytes_task where bytes are written from the buffer to the socket. This task awaits for write_all to complete.
  4. The tokio task TcpTransport.spawn_reading_task that reads the response and stores it on the queue
  5. The client's polling loop which waits on a fixed interval and reads the queue.

If the tokio executor threads were busy then perhaps they could be causing delays. Additionally there is a lot of locking going on which isn't optimal either.

But the obvious latency is that polling loop. Sleeping like that could result in 50ms of wasted time. A fix for that would improve performance a lot, e.g. create a fixed size channel per request. The caller would recv_timeout() on it and the tokio side would send the Response msg directly through the channel to it wake it up.

Outside of this, another thing to try is fire up Wireshark and capture traffic from the golang impl and the Rust impl and see if there are any obvious differences in the messages they send, or chunking or other potential causes of delay. It may be that Rust is sending more messages or doing stuff with token renewal which is non-optimal.

locka99 commented 2 years ago

I created a branch remove-client-polling to demonstrate the idea. The sync code now creates a sender which the tokio side will send to when the response is received. That means the polling loop collapses down to a simple recv_timeout()

locka99 commented 2 years ago

Branch works pretty well, let me know if it improves the metrics and then we can land it

milgner commented 2 years ago

Awesome! I just gave it a try and can confirm that the performance is now below 1 second (around 0.8s) for the whole browse process which was 24 seconds with the original polling code! :tada:

locka99 commented 2 years ago

I've landed it there, integration tests & unit test pass so I'm hoping it has no downsides in normal use.