Sync regression: index 4096 not in 0 .. 4095

mratsim commented 4 years ago

When syncing as of 17ca72cf5528e9c00eb7a6ffda2cd9188c678d8f

BG 2020-08-17 16:52:26.067+02:00 Updating peer's status information         topics="beacnde" tid=137799 file=sync_manager.nim:706 wall_clock_slot=94176 remote_head_slot=93296 local_head_slot=67898 peer=16Uiu2HAmR5TadXQ9pfGFVjiM252vmo43eXL3gbyXB5Cch4cdJtna peer_score=100 peer_speed=1247.0
DBG 2020-08-17 16:52:26.067+02:00 Synchronization loop waiting for workers   topics="beacnde" tid=137799 file=sync_manager.nim:965 wall_head_slot=94176 local_head_slot=67898 workers_count=10 sleep_time=6m24s0ns
DBG 2020-08-17 16:52:26.072+02:00 Block received                             topics="beacnde" tid=137799 file=beacon_node.nim:344 signedBlock="(slot: 67901, proposer_index: 2262, parent_root: \"0fd39850\", state_root: \"b91f548d\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 109, deposits_len: 0, voluntary_exits_len: 0)" blockRoot=f858db26 pcs=receive_block
INF 2020-08-17 16:52:26.238+02:00 Block resolved                             topics="clearance" tid=137799 file=clearance.nim:88 blck="(slot: 67901, proposer_index: 2262, parent_root: \"0fd39850\", state_root: \"b91f548d\", proposer_slashings_len: 0, attester_slashings_len: 0, attestations_len: 109, deposits_len: 0, voluntary_exits_len: 0)" blockRoot=f858db26 heads=1 pcs=block_resolution
DBG 2020-08-17 16:52:26.241+02:00 Fork choice requested                      topics="fork_choice" tid=137799 file=fork_choice.nim:300 justified_epoch=2120 justified_root=0cfbaed0 finalized_epoch=2119 fork_choice_head=f858db26
INF 2020-08-17 16:52:26.255+02:00 Updated head block                         topics="chaindag" tid=137799 file=chain_dag.nim:683 stateRoot=b91f548d headBlock=f858db26:67901 stateSlot=67901 justified="(epoch: 2120, root: \"0cfbaed0\")" finalized="(epoch: 2119, root: \"401b1970\")" newHead=f858db26:67901 pcs=fork_choice
DBG 2020-08-17 16:52:26.255+02:00 Block got imported successfully            topics="beacnde" tid=137799 file=beacon_node.nim:660 local_head_slot=67901 store_speed=5.4697 block_root=f858db26 block_slot=67901 store_block_duration=167ms426us410ns update_head_duration=15ms399us204ns overall_duration=182ms825us642ns
DBG 2020-08-17 16:52:26.255+02:00 Exception in poll()                        topics="beacnde" tid=137799 file=beacon_node.nim:1001 exc=LPStreamEOFError err="Stream EOF!"
 peers: 55 ❯ finalized: 401b1970:2119 ❯ head: aea5cff8:2121:22 ❯ time: 2942:17 (94161)                                                                                                                                                              :  Traceback (most recent call last, using override)
/home/beta/Programming/Status/nim-beacon-chain/beacon_chain/mainchain_monitor.nim(544) main
/home/beta/Programming/Status/nim-beacon-chain/beacon_chain/mainchain_monitor.nim(537) NimMain
/home/beta/Programming/Status/nim-beacon-chain/beacon_chain/beacon_node.nim(1355) main
/home/beta/Programming/Status/nim-beacon-chain/beacon_chain/beacon_node.nim(1049) start
/home/beta/Programming/Status/nim-beacon-chain/vendor/nim-chronicles/chronicles.nim(329) run
/home/beta/Programming/Status/nim-beacon-chain/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(407) reportUnhandledError
/home/beta/Programming/Status/nim-beacon-chain/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(358) reportUnhandledErrorAux
Error: unhandled exception: index 4096 not in 0 .. 4095 [IndexError]

mratsim commented 4 years ago

With more detailed log (i.e. without libbacktrace) (note that the bug happened in a later block)

DBG 2020-08-17 17:02:57.726+02:00 Exception in poll()                        topics="beacnde" tid=142024 file=beacon_node.nim:1001 exc=LPStreamEOFError err="Stream EOF!"
ERR 2020-08-17 17:02:57.737+02:00 Transport getMessage                       topics="discv5" tid=142024 file=protocol.nim:434 exception=TransportOsError msg="(11) Resource temporarily unavailable"
 peers: 57 ❯ finalized: de5d5a0c:2137 ❯ head: a508cfc6:2140:11 ❯ time: 2944:6 (94214)                                                                                                                                                               :  /home/beta/Programming/Status/nim-beacon-chain/vendor/nim-testutils/testutils/moduletests.nim(21) beacon_node
/home/beta/Programming/Status/nim-beacon-chain/beacon_chain/beacon_node.nim(1355) main
/home/beta/Programming/Status/nim-beacon-chain/beacon_chain/beacon_node.nim(1049) start
/home/beta/Programming/Status/nim-beacon-chain/beacon_chain/beacon_node.nim(999) run
/home/beta/Programming/Status/nim-beacon-chain/vendor/nim-chronos/chronos/asyncloop.nim(343) poll
/home/beta/Programming/Status/nim-beacon-chain/vendor/nim-chronos/chronos/transports/stream.nim(1329) readStreamLoop
/home/beta/Programming/Status/nim-beacon-chain/vendor/nimbus-build-system/vendor/Nim/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: index 4096 not in 0 .. 4095 [IndexError]

arnetheduck commented 4 years ago

root cause here is an exception escaping to the poll loop, corrupting the internal chronos state

cheatfate commented 4 years ago

Yes i agree, we see this because nim-libp2p leaks exceptions via asyncCheck calls, i'm going to remove all nim-libp2p asyncCheck calls very soon.

arnetheduck commented 4 years ago

what other ways than asyncCheck could this happen? some other form of callback? timers?

protolambda commented 4 years ago

Also experiencing this I think.

Traceback (most recent call last, using override)
/root/nim-beacon-chain/beacon_chain/mainchain_monitor.nim(544) main
/root/nim-beacon-chain/beacon_chain/mainchain_monitor.nim(537) NimMain
/root/nim-beacon-chain/beacon_chain/beacon_node.nim(1354) main
/root/nim-beacon-chain/beacon_chain/beacon_node.nim(1048) start
/root/nim-beacon-chain/vendor/nim-chronicles/chronicles.nim(329) run
/root/nim-beacon-chain/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(407) reportUnhandledError
/root/nim-beacon-chain/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(358) reportUnhandledErrorAux
Error: unhandled exception: index 4096 not in 0 .. 4095 [IndexError]

dryajov commented 4 years ago

asyncCheck is just sugar around a Future callback, removing it will not fix the core of the issues, asyncCheck should simply not leak the exception, it should make sure that the errors are properly communicated, probably in the form of a stacktrace to stderr or similar.

This is it:

proc asyncCheck*[T](future: Future[T]) =
  ## Sets a callback on ``future`` which raises an exception if the future
  ## finished with an error.
  ##
  ## This should be used instead of ``discard`` to discard void futures.
  doAssert(not isNil(future), "Future is nil")
  proc cb(data: pointer) =
    if future.failed() or future.cancelled():
      when defined(chronosStackTrace):
        injectStacktrace(future)
      raise future.error # RERAISING THE EXCEPTION
  future.callback = cb

The alternative is to just discard the future, but that has the downside that the error will be swallowed, asyncCheck is also better than discarding because it is more explicit and communicates intent properly.

sinkingsugar commented 4 years ago

I suggest checking chronos: https://github.com/status-im/nim-chronos/blob/master/chronos/transports/stream.nim#L1327-L1330 At least that's where it happens...

arnetheduck commented 4 years ago

asyncCheck has a particular behaviour in chronos - what's wrong is that it's used in contexts where exceptions get raised - it should not be used in those contexts at all - in the case of libp2p, that means the majority of contexts because cancellation exceptions are leaking up the stack all over the place, as well as other exceptions occasionally. saying that asyncCheck must be changed is like saying that a screwdriver is not a good hammer, and it should be re-invented.

In libp2p, the proper thing would be to handle the errors locally inside functions like send - that will also solve other issues including those places where send might raise in the middle of the loop even though from a logical point of view, the loop should continue (for example when sending the same thing to multiple peers)

arnetheduck commented 4 years ago

This seems to indeed have been worked around by by https://github.com/status-im/nim-libp2p/pull/338

sinkingsugar commented 4 years ago

I still have definitely this issue, also I don't connect exactly how asyncCheck can "corrupt chronos state", it's a defect, failure is still in https://github.com/status-im/nim-chronos/blob/master/chronos/transports/stream.nim#L1327-L1330

sinkingsugar commented 4 years ago

The callsite libp2p site was https://github.com/status-im/nim-libp2p/blob/d3182c4dba5cf0eaa11a1c41f065ea3c3b436995/libp2p/stream/chronosstream.nim#L62 iirc, we were looking at it with @dryajov , does not happen on windows so far. had it on linux.

mratsim commented 4 years ago

Didn't happen for me while syncing over the weekend.

sinkingsugar commented 4 years ago

is it me or this happens only at the very beginning of the instance? I had it 3 times in row when restarting (not very recently tho), then restarted.

sinkingsugar commented 4 years ago

Update: I just had it right now a couple of times, latest devel.