Really bad tcp performance on windows.

tokio-rs / tokio

A runtime for writing reliable asynchronous applications with Rust. Provides I/O, networking, scheduling, timers, ...

https://tokio.rs

MIT License

26.59k stars 2.45k forks source link

Really bad tcp performance on windows. #6614

Closed dtzxporter closed 4 months ago

dtzxporter commented 4 months ago

Version List the versions of all tokio crates you are using. The easiest way to get this information is using cargo tree subcommand:

1.37. I tested 1.38 as well, no change.

Platform Windows 11 64bit.

Description Seems like windows is having performance issues using TcpStream, with most of the time spent waiting on a Condvar.

I tried this code:

https://github.com/dtzxporter/hydra/tree/main/hydra-test-main https://github.com/dtzxporter/hydra/tree/main/hydra-test-client

I expected to see this happen:

Performance near 100k rps.

Instead, this happened:

2k rps performance, 99% of the time waiting on a Condvar:

Darksonn commented 4 months ago

I don't have the time to look into this right now, but a few notes:

The condition variable is how Tokio puts the thread to sleep when it has nothing to do.
Are you doing IO from a thread that is not a Tokio worker threads?
Are you observing a lot of movement between worker threads?

hawkw commented 4 months ago

Could this be Nagle's algorithm, perhaps?

dtzxporter commented 4 months ago

The flow is:

A framed TcpStream split into read/write pairs, owned by two tokio tasks (send/recv).
All tasks have flume channels to send/recv data from, with the socket owning task being the only thing with access to the socket.
There are 8 spawned tasks sending messages to the channel for the sender (outbound sends).
The recv task is forwarding the incoming messages to one of 8 channels for the spawned tasks.

FWIW: I don't think it's an issue with the 8 spawned tasks hammering the flume channel, because cutting out the socket yields 26million msg/s between the 'tasks' themselves, ignoring the outbound sends.

Let me know if there is any other information that would help with diagnosing this issue! I'm available to debug / do whatever, I just don't know enough about the internals at the moment to get a baseline of where to start.

Simple diagram:

Untitled drawio

dtzxporter commented 4 months ago

Could this be Nagle's algorithm, perhaps?

Welp, that was it... setting nodelay made the issue go away!

hawkw commented 4 months ago

It's always Nagle! :)