Closed mratsim closed 3 years ago
With more detailed log (i.e. without libbacktrace) (note that the bug happened in a later block)
DBG 2020-08-17 17:02:57.726+02:00 Exception in poll() topics="beacnde" tid=142024 file=beacon_node.nim:1001 exc=LPStreamEOFError err="Stream EOF!"
ERR 2020-08-17 17:02:57.737+02:00 Transport getMessage topics="discv5" tid=142024 file=protocol.nim:434 exception=TransportOsError msg="(11) Resource temporarily unavailable"
peers: 57 ❯ finalized: de5d5a0c:2137 ❯ head: a508cfc6:2140:11 ❯ time: 2944:6 (94214) : /home/beta/Programming/Status/nim-beacon-chain/vendor/nim-testutils/testutils/moduletests.nim(21) beacon_node
/home/beta/Programming/Status/nim-beacon-chain/beacon_chain/beacon_node.nim(1355) main
/home/beta/Programming/Status/nim-beacon-chain/beacon_chain/beacon_node.nim(1049) start
/home/beta/Programming/Status/nim-beacon-chain/beacon_chain/beacon_node.nim(999) run
/home/beta/Programming/Status/nim-beacon-chain/vendor/nim-chronos/chronos/asyncloop.nim(343) poll
/home/beta/Programming/Status/nim-beacon-chain/vendor/nim-chronos/chronos/transports/stream.nim(1329) readStreamLoop
/home/beta/Programming/Status/nim-beacon-chain/vendor/nimbus-build-system/vendor/Nim/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: index 4096 not in 0 .. 4095 [IndexError]
root cause here is an exception escaping to the poll loop, corrupting the internal chronos state
Yes i agree, we see this because nim-libp2p
leaks exceptions via asyncCheck
calls, i'm going to remove all nim-libp2p
asyncCheck
calls very soon.
what other ways than asyncCheck could this happen? some other form of callback? timers?
Also experiencing this I think.
Traceback (most recent call last, using override)
/root/nim-beacon-chain/beacon_chain/mainchain_monitor.nim(544) main
/root/nim-beacon-chain/beacon_chain/mainchain_monitor.nim(537) NimMain
/root/nim-beacon-chain/beacon_chain/beacon_node.nim(1354) main
/root/nim-beacon-chain/beacon_chain/beacon_node.nim(1048) start
/root/nim-beacon-chain/vendor/nim-chronicles/chronicles.nim(329) run
/root/nim-beacon-chain/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(407) reportUnhandledError
/root/nim-beacon-chain/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(358) reportUnhandledErrorAux
Error: unhandled exception: index 4096 not in 0 .. 4095 [IndexError]
asyncCheck
is just sugar around a Future
callback, removing it will not fix the core of the issues, asyncCheck
should simply not leak the exception, it should make sure that the errors are properly communicated, probably in the form of a stacktrace to stderr or similar.
This is it:
proc asyncCheck*[T](future: Future[T]) =
## Sets a callback on ``future`` which raises an exception if the future
## finished with an error.
##
## This should be used instead of ``discard`` to discard void futures.
doAssert(not isNil(future), "Future is nil")
proc cb(data: pointer) =
if future.failed() or future.cancelled():
when defined(chronosStackTrace):
injectStacktrace(future)
raise future.error # RERAISING THE EXCEPTION
future.callback = cb
The alternative is to just discard
the future, but that has the downside that the error will be swallowed, asyncCheck
is also better than discarding because it is more explicit and communicates intent properly.
I suggest checking chronos: https://github.com/status-im/nim-chronos/blob/master/chronos/transports/stream.nim#L1327-L1330 At least that's where it happens...
asyncCheck
has a particular behaviour in chronos - what's wrong is that it's used in contexts where exceptions get raised - it should not be used in those contexts at all - in the case of libp2p, that means the majority of contexts because cancellation exceptions are leaking up the stack all over the place, as well as other exceptions occasionally. saying that asyncCheck
must be changed is like saying that a screwdriver is not a good hammer, and it should be re-invented.
In libp2p, the proper thing would be to handle the errors locally inside functions like send - that will also solve other issues including those places where send might raise in the middle of the loop even though from a logical point of view, the loop should continue (for example when sending the same thing to multiple peers)
This seems to indeed have been worked around by by https://github.com/status-im/nim-libp2p/pull/338
I still have definitely this issue, also I don't connect exactly how asyncCheck
can "corrupt chronos state", it's a defect, failure is still in https://github.com/status-im/nim-chronos/blob/master/chronos/transports/stream.nim#L1327-L1330
The callsite libp2p site was https://github.com/status-im/nim-libp2p/blob/d3182c4dba5cf0eaa11a1c41f065ea3c3b436995/libp2p/stream/chronosstream.nim#L62 iirc, we were looking at it with @dryajov , does not happen on windows so far. had it on linux.
Didn't happen for me while syncing over the weekend.
is it me or this happens only at the very beginning of the instance? I had it 3 times in row when restarting (not very recently tho), then restarted.
Update: I just had it right now a couple of times, latest devel.
I also had it a couple times while syncing
I believe this has been fixed:
Still a problem: https://github.com/status-im/nimbus-eth2/issues/1957
That's a different issue tho, his log says clearly too many files open.
When syncing as of 17ca72cf5528e9c00eb7a6ffda2cd9188c678d8f