Windows implementation consumes lots of memory (linux doesn't)

sateffen commented 8 years ago

Hey,

like mentioned in #415 windows consumes a huge lot of memory for a simple echo server with lots (5-10k) connections (sending and receiving data).

On this simple project creating around 10k connections consumes ~700mb of ram, on a linux machine it takes around 7mb. After all connections are closed the memory is freed again.

To reproduce:

Clone the repo
open a shell and execute cargo run
open a shell and execute node test.js

Setup:

Windows 10
Node 6.2.2
Rust 1.8.0
mio 0.5.1

This will spin up 10k connections to localhost:8888 (the waiting rust echo server).

I'm experiencing this on a windows 10 laptop, simply observed with the taskmanager. As linux reference I've used a private rootserver and a cloud9 machine, none of them showed this behaviour.

As soon as I've got some time I'll update my compiler and try to create some simpler code to reproduce, but currently my time is somehow limited.

If any questions left, just ask :)

Cheers

alexcrichton commented 8 years ago

Windows is implemented differently than Unix due to the IOCP model, so there's bound to be some more overhead than before. Right now Windows uses a blanket 64KB buffer for all reads on sockets, and if you have 10000 sockets that's 655 MB right there (close to the 700MB you're seeing)

The behavior on Windows should likely be smarter than just "always allocate 64KB buffers" along with the ability to tune the buffer size per connection ideally.

sateffen commented 8 years ago

That sounds pretty much like the behaviour I'm experiencing. The 65mb + base memory + the bytes copied to my local buffers might go up to ~700mb (I don't remember the actual number, but lets assume it's around 650-700mb).

In my case I send around 14 byte payload, 64KB is pretty much too much in this case. Is there a chance to optimize this, or is the IOCP model that wrecked, that it has to be like this?

How is libuv doing the job not raising so much memory? Maybe we can check their tricks to learn from them?

alexcrichton commented 8 years ago

Oh there's nothing inherently bad about IOCP here, we just have some heuristics about when to schedule I/O which probably need a large amount of tweaking or ability to configure.

ghost commented 8 years ago

In libuv user is fully responsible for memory allocation and ensuring that buffers are kept alive until IO operations complete, so that is not exactly comparable with what is in MIO.

But this makes me wonder, why is overlapped used in Windows implementation in the first place? I understand that it is preferable way to do things on Windows, but is it still true if you don't expose completion based API to end-user and need additional copies anyway? Wouldn't non-blocking IO (without overlapped) be more suitable given the MIO interface?

tailhook commented 8 years ago

I think this is another sign that mio implements IOCP support at the wrong level of abstraction. I've tried to explain it here. Let me try to elaborate on that.

The biggest point here is that it's hard to find a good heuristics for buffers size without knowing the application domain.

But if windows support is implemented at the rotor-stream level the communication between application and IO handling code would be in the terms of the Expectation structure, which looks like this (simplified):

pub enum Expectation {
    Bytes(/* max_bytes: */ usize),
    Delimiter { delimiter: &'static [u8], max_bytes: usize },
    Sleep,
}

Let me explain a little bit how it is used in HTTP protocol (simplified):

On connection accept http protocol parser returns Bytes(1) to rotor-stream which basically means read into the smallest buffer
When read is complete we check if there is the end of headers in the buffer(*). If there is not, we expect a little bit more bytes. But buffer is kept in rotor-stream, so we don't have to reallocate buffer if we received few bytes, just move the pointer
If headers are received we check Content-Length and return Bytes(content_length) expectation
When request is received we return Sleep expectation (which would mean no buffer is needed for read using IOCP as long as request is being processed)
After sending response we go back to (1) for keep-alive connection (i.e. as small buffer as possible)

Most protocols could be broken down into these kinds of primitives (may be current set of expectations is not full or wrong, we just need to figure out) improving the control over the size of the buffer. This also allows reusing a buffer for multiple operations on the same connection.

(*) we don't use Delimiter(b"\r\n\r\n") there because there are some clients that expect that b"\n\n" is a valid delimiter for HTTP headers despite the spec

The alternative like add a tunable sock.set_iocp_buffer_size(n) is not very useful. Linux users will not get it right very often because they don't need it and can't actually test. And windows users will not get it right too because changing buffer size doesn't influence correctness, so forgetting to change a value in some state transition (like inactive connection becomes active or vice versa) will be visible only after a large amount of profiling.

( @alexcrichton , @carllerche looking forward to discussing these things in person on Rust Belt Rust :) )

alexcrichton commented 8 years ago

@tailhook yeah I think the best solution here would be providing more hooks into the I/O layer to tune buffer size and have more control about when buffer reads/writes are scheduled. It actually shouldn't be too hard to do so as a windows-specific extension trait I think, but the pieces are indeed tricky!

rrichardson commented 8 years ago

If we made it so the writer and reader knew how to chain buffers, we could just have one reusable slab for all connections.

On Fri, Jul 22, 2016, 7:40 PM Alex Crichton notifications@github.com wrote:

@tailhook https://github.com/tailhook yeah I think the best solution here would be providing more hooks into the I/O layer to tune buffer size and have more control about when buffer reads/writes are scheduled. It actually shouldn't be too hard to do so as a windows-specific extension trait I think, but the pieces are indeed tricky!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/carllerche/mio/issues/439#issuecomment-234682232, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHlC3sqWprtkrkuo72Nom9jiVT7TRAgks5qYVTOgaJpZM4JQqMb .

carllerche commented 8 years ago

@sateffen At the end of the day, IOCP requires that you pass ownership of the data to the OS, so you will need to have memory allocated for each in-flight operation. The best we can do is to tune things to balance memory / speed.

@alexcrichton Is there a reason you picked 64kb? Could that be reduced a bit?

alexcrichton commented 8 years ago

@carllerche I believe it was because long ago libstd picked 64kb for a buffer size because long ago libuv picked a 64kb buffer size because purportedly long ago linux worked beset with 64k buffers.

In other words, we should be free to change at will, I see no reason to keep it so high.

nayato commented 8 years ago

libuv does a read in the following pattern:

async read with a zero-length buffer which is used only for signaling that there's data to read from socket's receive buffer (which is OS managed and can be configured but isn't recommended to turn off unless you really know what you're doing).
sync non-blocking read until socket's receive buffer is drained - either WSAEWOULDBLOCK is returned or lenght of the buffer supplied in WSARecv call is greater than count of read bytes.

We do the same in DotNetty (.net port of netty). That way you can have lots (200K+) connections with 0 buffers reserved for any of them. Somewhat interesting optimization is to allow to tune the size of async buffer which might be beneficial when people are fine sacrificing some memory if they can estimate the size of expected messages.

carllerche commented 8 years ago

I am... amazed that works, but if it does it would be great 👍 Thanks for the tip @nayato I will look more into the two examples you pointed out.

carllerche commented 8 years ago

@nayato Do you know if this trick works w/ writing buffers too?

alexcrichton commented 8 years ago

Thanks for the info @nayato! I've posted a PR at https://github.com/carllerche/mio/pull/471

It looked like libuv didn't employ this trick for TCP writes, nor for UDP sends/recvs. I've left those with the previous strategy of allocating buffers for now.

nayato commented 8 years ago

no, you can't submit zero-byte buffer to WSASend unfortunately.

nayato commented 8 years ago

@alexcrichton, another optimization you might want to consider to use SetFileCompletionNotificationModes to indicate that if async operation completes synchronously, IO completion should not be triggered. There's one bug with that for UDP though which I know was fixed in Win 8 but I'm not sure if it was back-ported to Win7. There's a comment on that in libuv: https://github.com/libuv/libuv/blob/v1.x/src/win/winsock.c#L271.

alexcrichton commented 8 years ago

@nayato indeed yeah! I saw a lot of comments related to that in libuv. I'll open a separate tracking issue for that.

alexcrichton commented 8 years ago

(opened https://github.com/carllerche/mio/issues/476)

alexcrichton commented 8 years ago

Ok, I've published #476 as 0.6.1 of mio, so in theory the memory usage with lots of TCP reads in flight should be much less

carllerche commented 7 years ago

Thanks all. I believe that we've done as much as we can to resolve this issue. I am going to close it now. If there are further ideas of how to improve the memory situation on windows, a new issue should be opened.

tokio-rs / mio

Windows implementation consumes lots of memory (linux doesn't) #439