vacp2p / nim-libp2p

libp2p implementation in Nim
https://vacp2p.github.io/nim-libp2p/docs/
MIT License
249 stars 54 forks source link

Add timeouts and message size limits to mplex #45

Closed dryajov closed 4 years ago

dryajov commented 4 years ago

Currently mplex doesn't support any sort of timeout functionality, so streams would hang indefinitely if the remote dies unexpectedly or is too slow.

Message size limiting is also missing.

sinkingsugar commented 4 years ago

I was looking at the timeout issue and noticed that here: https://github.com/status-im/nim-libp2p/blob/2232ca598e514ab79befac972c9ae0517b1196c3/libp2p/muxers/mplex/mplex.nim#L69

Internally it's actually using the timeout tied to connection here: https://github.com/status-im/nim-libp2p/blob/2232ca598e514ab79befac972c9ae0517b1196c3/libp2p/connection.nim#L58

Altho I'm not sure if that is what the TODO means since it mentions polling, the idea here is that we actually want to block right? (code seems so) so just not awaiting (and so stopping the flow) should be enough here?

sinkingsugar commented 4 years ago

About limit, we are talking of writing and specifically the 1MiB limit, is that correct? https://github.com/libp2p/specs/tree/master/mplex#writing-to-a-stream

dryajov commented 4 years ago

I think that TODO is a bit outdated, so I would not try to make sense out it now. There is a larger issue with timeouts in nim-chronos, there is some info here - https://github.com/status-im/nim-chronos/pull/65, essentially we need to implement a more robust solution.

With the timeouts in connection, I'm actually removing those in https://github.com/status-im/nim-libp2p/pull/71, I think a better approach is to use timeouts less sparingly on an as needed basis. This did endup uncovering a lot of issues with timers in chronos tho, so it was a good test at the end.

Bottom line, I would add timeouts and size limits in mplex itself (which is what this issue is referring to), the 1MB size limit seems to be what we need.

sinkingsugar commented 4 years ago

Hahah you hit the right key there.. I completely personally dislike async/await (I'm for stackful coroutines) and one of the reason is exactly that it can spike out of control in terms of resource usage when cancelation/timeout is not properly handled.. And the dislike started when I was using them in C#/.NET which has a nice and stable implementation...

Nevertheless async/await is there because this is (for now) a GCed language and there is not much that can be done about it :)

nim-chronos needs future cancelation, I thought that the point of re-implementing the standard async library was exactly because cancelation was missing...

In conclusion:

sinkingsugar commented 4 years ago

Btw I noticed that running testmplex.nim with -d:chronicles_log_level=TRACE there are millions of swallowed exceptions like: "Incomplete data received". Is this behavior expected?

arnetheduck commented 4 years ago

incidentally, chronos has future cancellation.. https://github.com/status-im/nim-chronos/pull/41

arnetheduck commented 4 years ago

Is this behavior expected?

no, this needs auditing, review and refactoring. much of the current nimbus/libp2p code is at the proof-of-concept/prototype stage.

sinkingsugar commented 4 years ago

Ah nice on the cancelation, the readme.md should be updated 😄

sinkingsugar commented 4 years ago

Sadly it's not what I meant which is more something like: https://docs.microsoft.com/en-us/dotnet/api/system.threading.cancellationtokensource?redirectedfrom=MSDN&view=netframework-4.8 and https://docs.microsoft.com/en-us/dotnet/api/system.threading.cancellationtoken?view=netframework-4.8 Where you can specify a timeout and cancelation can properly cascade using the token and implements timeouts on each source. But maybe something can be cooked up similarly.

dryajov commented 4 years ago

Btw I noticed that running testmplex.nim with -d:chronicles_log_level=TRACE there are millions of swallowed exceptions like: "Incomplete data received". Is this behavior expected?

Yes, they are expected in many situations, this is exactly why they are only shown when tracing is enabled. The reason is because chronos signals disconnected connections with exceptions. Perhaps we need to add more checks for EOF somewhere.

no, this needs auditing, review and refactoring. much of the current nimbus/libp2p code is at the proof-of-concept/prototype stage.

I would say that this is partially true, but not entirely, some parts could be reworked to be more nim idiomatic, but apart from that structurally is pretty sound.

sinkingsugar commented 4 years ago

@arnetheduck I could successfully emulate the c# like behavior I wanted for async/await btw, still unsure on how the cancelation works (had a quick look at it) but seems fine!

arnetheduck commented 4 years ago

incomplete data received does not sound like a controlled shutdown - either the wrong exception is being raised, the shutdown is randomly interrupted or something else is wrong and data loss is possible...

dryajov commented 4 years ago

incomplete data received does not sound like a controlled shutdown - either the wrong exception is being raised, the shutdown is randomly interrupted or something else is wrong and data loss is possible...

This happens when awaiting a socket that closes, it throws an incomplete data received exception to signal a close. There really isn't any other way of signaling that with async code, perhaps the exception should have a different message.

cheatfate commented 4 years ago

Just to clarify exception Incomplete data received, it happens because nim-libp2p code uses readUntil and if readUntil could not satisfy number of bytes requested it returns IncompleteError.

So for example if you will use read() here it will not raise.

cheatfate commented 4 years ago

Also i can't figure out why we need timeouts at this stage.

For example if one node will be on Mars and another on Earth, then timeout should be from 187 seconds to 1342 seconds. And so application which uses libp2p for such type of communication should handle even such long timeouts properly.

So from my point of view we should insert timeouts only on connect/handshake phase. All other timeouts should be part of the protocol specification (gossip and so on).

dryajov commented 4 years ago

So from my point of view we should insert timeouts only on connect/handshake phase. All other timeouts should be part of the protocol specification (gossip and so on).

Agreed, I was trying to emulate some of the behavior in Go libp2p that uses timeouts in certain places, in reality I agree that only handshake in secio would require this behavior, most other timeouts should be left up to the application to set.