don't swallow random exceptions

arnetheduck commented 5 years ago

this hides bugs, like this one:

DBG 2018-12-26 08:48:15-06:00 Exception in rlpxConnect                   topics="rlpx" thread=0 err="Future still in progress.\nAsync traceback:\n  cligen.nim(797)           status_chat\n  cligen.nim(552)           dispatchstatusListener\n  status_chat.nim(112)      statusListener\n  asyncloop.nim(238)        poll\n  asyncmacro2.nim(36)       handshakeImpl_continue\n  rlpx.nim(1158)            handshakeImplIter\n  asyncfutures2.nim(330)    read\n  #[\n    cligen.nim(797)           status_chat\n    cligen.nim(552)           dispatchstatusListener\n    status_chat.nim(112)      statusListener\n    asyncloop.nim(238)        poll\n    asyncmacro2.nim(36)       WhisperHandshake_continue\n    whisper_protocol.nim(695) WhisperHandshakeIter\n    asyncfutures2.nim(325)    read\n  ]#\n  #[\n    cligen.nim(797)           status_chat\n    cligen.nim(552)           dispatchstatusListener\n    status_chat.nim(112)      statusListener\n    asyncloop.nim(238)        poll\n    asyncmacro2.nim(36)       postHelloSteps_continue\n    rlpx.nim(1245)            postHelloStepsIter\n    asyncfutures2.nim(325)    read\n  ]#\nException message: Future still in progress.\nException type:" exc=ValueError remote=Node[54.238.158.169:30303]

stefantalpalaru commented 5 years ago

But will it bail out instead of reconnecting and recovering? Find out by running a dozen or so iterations of this:

N=0; while true; do echo $N; rm -rf ~/.cache/nimbus/db; time ./build/nimbus; sleep 1 || break; N=$[$N + 1]; done

arnetheduck commented 5 years ago

as discussed - we cannot allow random exceptions to be swallowed as their effect on the overall state is unknown (similar to a signal handler in unix, these exceptions can and will happen at any point in a nim program, including spots where they weren't anticipated by the author) - this counts for overflows, asserts and other similar errors - they indicate bugs in the application - like this particular instance - the root cause is a race condition or some other serious issue, and it needs urgent attention.

zah commented 5 years ago

There is a new exception type in Nim called CatchableError which is supposed to be the base type of all recoverable errors. Putting a blanket handler with this exception type may be reasonable sometimes, but I agree that in this particular situation it's probably better to stick to the list of specific expected failures such as PeerDisconnected and RlpError. Non-recoverable errors should always fail fast.

kdeme commented 5 years ago

I agree here, as long as the error and thus cause/effect is not known, error should be raised. And in debug it probably should also assert (so it is clearly noticed).

I actually had a similar issue before in the rlpxAccept code, where the exception would only be logged and it could be missed if not paying attention to all the logs. I noticed because I was specifically debugging that problem (AlreadyConnected peer).

stefantalpalaru commented 5 years ago

How is anyone supposed to work on this, if Nimbus keeps throwing random exceptions instead of logging them as recoverable errors?

arnetheduck commented 5 years ago

@stefantalpalaru there are a few critical things to take heed of here:

we presently have no users - that the application crashes halfway because of unfinished code is quite ok - this is a luxury we should enjoy and exploit by using the time to make sure our foundation is solid
it is not helpful for the goals of the application to avoid a crash here and let the application continue running with hidden deep flaws - we're working on the wrong things when that happens - those deeper issues will be more costly to resolve the further down the road we are
we're writing a peer-to-peer application - unlike a local or a pure client application, it will have connections to remote nodes that can affect what the application is doing - it's not youtube-dl that will reattempt a download but a node in a network with thousands of instances running and communicating at the same time
when the application is released, people with strong interests will be motivated to break it
the network layer is the first line of defense - it needs to be solid and not automatically do unexpected or unplanned-for things.
for example, being able to induce a reconnect remotely is a classic form of starting a DDoS attack - if you find a flaw that makes thousands of nodes reconnect, it's very cheap to cause trouble

thus, long story short, the way to work and make actual progress on the application is to focus on the root cause for the issue and make sure it's addressed properly - and make deliberate decisions about expected and unexpected behavior, falling back on something that's guaranteed to not cause further damage once an unaccounted for situation is encountered - in this case, show a diagnostic and shut down.

status-im / nim-eth-p2p

don't swallow random exceptions #56