Closed sateffen closed 7 years ago
Windows is implemented differently than Unix due to the IOCP model, so there's bound to be some more overhead than before. Right now Windows uses a blanket 64KB buffer for all reads on sockets, and if you have 10000 sockets that's 655 MB right there (close to the 700MB you're seeing)
The behavior on Windows should likely be smarter than just "always allocate 64KB buffers" along with the ability to tune the buffer size per connection ideally.
That sounds pretty much like the behaviour I'm experiencing. The 65mb + base memory + the bytes copied to my local buffers might go up to ~700mb (I don't remember the actual number, but lets assume it's around 650-700mb).
In my case I send around 14 byte payload, 64KB is pretty much too much in this case. Is there a chance to optimize this, or is the IOCP model that wrecked, that it has to be like this?
How is libuv doing the job not raising so much memory? Maybe we can check their tricks to learn from them?
Oh there's nothing inherently bad about IOCP here, we just have some heuristics about when to schedule I/O which probably need a large amount of tweaking or ability to configure.
In libuv user is fully responsible for memory allocation and ensuring that buffers are kept alive until IO operations complete, so that is not exactly comparable with what is in MIO.
But this makes me wonder, why is overlapped used in Windows implementation in the first place? I understand that it is preferable way to do things on Windows, but is it still true if you don't expose completion based API to end-user and need additional copies anyway? Wouldn't non-blocking IO (without overlapped) be more suitable given the MIO interface?
I think this is another sign that mio implements IOCP support at the wrong level of abstraction. I've tried to explain it here. Let me try to elaborate on that.
The biggest point here is that it's hard to find a good heuristics for buffers size without knowing the application domain.
But if windows support is implemented at the rotor-stream
level the communication between application and IO handling code would be in the terms of the Expectation structure, which looks like this (simplified):
pub enum Expectation {
Bytes(/* max_bytes: */ usize),
Delimiter { delimiter: &'static [u8], max_bytes: usize },
Sleep,
}
Let me explain a little bit how it is used in HTTP protocol (simplified):
Bytes(1)
to rotor-stream
which basically means read into the smallest bufferrotor-stream
, so we don't have to reallocate buffer if we received few bytes, just move the pointerContent-Length
and return Bytes(content_length)
expectationSleep
expectation (which would mean no buffer is needed for read using IOCP as long as request is being processed)Most protocols could be broken down into these kinds of primitives (may be current set of expectations is not full or wrong, we just need to figure out) improving the control over the size of the buffer. This also allows reusing a buffer for multiple operations on the same connection.
(*) we don't use Delimiter(b"\r\n\r\n")
there because there are some clients that expect that b"\n\n"
is a valid delimiter for HTTP headers despite the spec
The alternative like add a tunable sock.set_iocp_buffer_size(n)
is not very useful. Linux users will not get it right very often because they don't need it and can't actually test. And windows users will not get it right too because changing buffer size doesn't influence correctness, so forgetting to change a value in some state transition (like inactive connection becomes active or vice versa) will be visible only after a large amount of profiling.
( @alexcrichton , @carllerche looking forward to discussing these things in person on Rust Belt Rust :) )
@tailhook yeah I think the best solution here would be providing more hooks into the I/O layer to tune buffer size and have more control about when buffer reads/writes are scheduled. It actually shouldn't be too hard to do so as a windows-specific extension trait I think, but the pieces are indeed tricky!
If we made it so the writer and reader knew how to chain buffers, we could just have one reusable slab for all connections.
On Fri, Jul 22, 2016, 7:40 PM Alex Crichton notifications@github.com wrote:
@tailhook https://github.com/tailhook yeah I think the best solution here would be providing more hooks into the I/O layer to tune buffer size and have more control about when buffer reads/writes are scheduled. It actually shouldn't be too hard to do so as a windows-specific extension trait I think, but the pieces are indeed tricky!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/carllerche/mio/issues/439#issuecomment-234682232, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHlC3sqWprtkrkuo72Nom9jiVT7TRAgks5qYVTOgaJpZM4JQqMb .
@sateffen At the end of the day, IOCP requires that you pass ownership of the data to the OS, so you will need to have memory allocated for each in-flight operation. The best we can do is to tune things to balance memory / speed.
@alexcrichton Is there a reason you picked 64kb? Could that be reduced a bit?
@carllerche I believe it was because long ago libstd picked 64kb for a buffer size because long ago libuv picked a 64kb buffer size because purportedly long ago linux worked beset with 64k buffers.
In other words, we should be free to change at will, I see no reason to keep it so high.
libuv does a read in the following pattern:
We do the same in DotNetty (.net port of netty). That way you can have lots (200K+) connections with 0 buffers reserved for any of them. Somewhat interesting optimization is to allow to tune the size of async buffer which might be beneficial when people are fine sacrificing some memory if they can estimate the size of expected messages.
I am... amazed that works, but if it does it would be great 👍 Thanks for the tip @nayato I will look more into the two examples you pointed out.
@nayato Do you know if this trick works w/ writing buffers too?
Thanks for the info @nayato! I've posted a PR at https://github.com/carllerche/mio/pull/471
It looked like libuv didn't employ this trick for TCP writes, nor for UDP sends/recvs. I've left those with the previous strategy of allocating buffers for now.
no, you can't submit zero-byte buffer to WSASend unfortunately.
@alexcrichton, another optimization you might want to consider to use SetFileCompletionNotificationModes
to indicate that if async operation completes synchronously, IO completion should not be triggered. There's one bug with that for UDP though which I know was fixed in Win 8 but I'm not sure if it was back-ported to Win7. There's a comment on that in libuv: https://github.com/libuv/libuv/blob/v1.x/src/win/winsock.c#L271.
@nayato indeed yeah! I saw a lot of comments related to that in libuv. I'll open a separate tracking issue for that.
Ok, I've published #476 as 0.6.1 of mio, so in theory the memory usage with lots of TCP reads in flight should be much less
Thanks all. I believe that we've done as much as we can to resolve this issue. I am going to close it now. If there are further ideas of how to improve the memory situation on windows, a new issue should be opened.
Hey,
like mentioned in #415 windows consumes a huge lot of memory for a simple echo server with lots (5-10k) connections (sending and receiving data).
On this simple project creating around 10k connections consumes ~700mb of ram, on a linux machine it takes around 7mb. After all connections are closed the memory is freed again.
To reproduce:
Setup:
This will spin up 10k connections to localhost:8888 (the waiting rust echo server).
I'm experiencing this on a windows 10 laptop, simply observed with the taskmanager. As linux reference I've used a private rootserver and a cloud9 machine, none of them showed this behaviour.
As soon as I've got some time I'll update my compiler and try to create some simpler code to reproduce, but currently my time is somehow limited.
If any questions left, just ask :)
Cheers