status-im / nim-eth-p2p

Nim Ethereum P2P protocol implementation
Apache License 2.0
11 stars 4 forks source link

don't swallow random exceptions #56

Open arnetheduck opened 5 years ago

arnetheduck commented 5 years ago

this hides bugs, like this one:

DBG 2018-12-26 08:48:15-06:00 Exception in rlpxConnect                   topics="rlpx" thread=0 err="Future still in progress.\nAsync traceback:\n  cligen.nim(797)           status_chat\n  cligen.nim(552)           dispatchstatusListener\n  status_chat.nim(112)      statusListener\n  asyncloop.nim(238)        poll\n  asyncmacro2.nim(36)       handshakeImpl_continue\n  rlpx.nim(1158)            handshakeImplIter\n  asyncfutures2.nim(330)    read\n  #[\n    cligen.nim(797)           status_chat\n    cligen.nim(552)           dispatchstatusListener\n    status_chat.nim(112)      statusListener\n    asyncloop.nim(238)        poll\n    asyncmacro2.nim(36)       WhisperHandshake_continue\n    whisper_protocol.nim(695) WhisperHandshakeIter\n    asyncfutures2.nim(325)    read\n  ]#\n  #[\n    cligen.nim(797)           status_chat\n    cligen.nim(552)           dispatchstatusListener\n    status_chat.nim(112)      statusListener\n    asyncloop.nim(238)        poll\n    asyncmacro2.nim(36)       postHelloSteps_continue\n    rlpx.nim(1245)            postHelloStepsIter\n    asyncfutures2.nim(325)    read\n  ]#\nException message: Future still in progress.\nException type:" exc=ValueError remote=Node[54.238.158.169:30303]
stefantalpalaru commented 5 years ago

But will it bail out instead of reconnecting and recovering? Find out by running a dozen or so iterations of this:

N=0; while true; do echo $N; rm -rf ~/.cache/nimbus/db; time ./build/nimbus; sleep 1 || break; N=$[$N + 1]; done
arnetheduck commented 5 years ago

as discussed - we cannot allow random exceptions to be swallowed as their effect on the overall state is unknown (similar to a signal handler in unix, these exceptions can and will happen at any point in a nim program, including spots where they weren't anticipated by the author) - this counts for overflows, asserts and other similar errors - they indicate bugs in the application - like this particular instance - the root cause is a race condition or some other serious issue, and it needs urgent attention.

zah commented 5 years ago

There is a new exception type in Nim called CatchableError which is supposed to be the base type of all recoverable errors. Putting a blanket handler with this exception type may be reasonable sometimes, but I agree that in this particular situation it's probably better to stick to the list of specific expected failures such as PeerDisconnected and RlpError. Non-recoverable errors should always fail fast.

kdeme commented 5 years ago

I agree here, as long as the error and thus cause/effect is not known, error should be raised. And in debug it probably should also assert (so it is clearly noticed).

I actually had a similar issue before in the rlpxAccept code, where the exception would only be logged and it could be missed if not paying attention to all the logs. I noticed because I was specifically debugging that problem (AlreadyConnected peer).

stefantalpalaru commented 5 years ago

How is anyone supposed to work on this, if Nimbus keeps throwing random exceptions instead of logging them as recoverable errors?

arnetheduck commented 5 years ago

@stefantalpalaru there are a few critical things to take heed of here:

thus, long story short, the way to work and make actual progress on the application is to focus on the root cause for the issue and make sure it's addressed properly - and make deliberate decisions about expected and unexpected behavior, falling back on something that's guaranteed to not cause further damage once an unaccounted for situation is encountered - in this case, show a diagnostic and shut down.