status-im / nimbus-eth2

Nim implementation of the Ethereum Beacon Chain
https://nimbus.guide
Other
543 stars 233 forks source link

TransportOsError on `make witti` #1123

Closed corpetty closed 3 years ago

corpetty commented 4 years ago

after about 4-5 hrs of uptime, i get the following ad-nausium

DBG 2020-06-04 20:44:18+00:00 Exception in poll()                        topics="beacnde" tid=18786 file=beacon_node.nim:718 err="(24) Too many open files" exc=TransportOsError
DBG 2020-06-04 20:44:18+00:00 Exception in poll()                        topics="beacnde" tid=18786 file=beacon_node.nim:718 err="(24) Too many open files" exc=TransportOsError
DBG 2020-06-04 20:44:18+00:00 Exception in poll()                        topics="beacnde" tid=18786 file=beacon_node.nim:718 err="(24) Too many open files" exc=TransportOsError
DBG 2020-06-04 20:44:18+00:00 Exception in poll()                        topics="beacnde" tid=18786 file=beacon_node.nim:718 err="(24) Too many open files" exc=TransportOsError
DBG 2020-06-04 20:44:18+00:00 Exception in poll()                        topics="beacnde" tid=18786 file=beacon_node.nim:718 err="(24) Too many open files" exc=TransportOsError
DBG 2020-06-04 20:44:18+00:00 Exception in poll()                        topics="beacnde" tid=18786 file=beacon_node.nim:718 err="(24) Too many open files" exc=TransportOsError
DBG 2020-06-04 20:44:18+00:00 Exception in poll()                        topics="beacnde" tid=18786 file=beacon_node.nim:718 err="(24) Too many open files" exc=TransportOsError
DBG 2020-06-04 20:44:18+00:00 Exception in poll()                        topics="beacnde" tid=18786 file=beacon_node.nim:718 err="(24) Too many open files" exc=TransportOsError
DBG 2020-06-04 20:44:18+00:00 Exception in poll()                        topics="beacnde" tid=18786 file=beacon_node.nim:718 err="(24) Too many open files" exc=TransportOsError
DBG 2020-06-04 20:44:18+00:00 Exception in poll()                        topics="beacnde" tid=18786 file=beacon_node.nim:718 err="(24) Too many open files" exc=TransportOsError

unfortunately, I was not able to see where it started as I was afk and came back to it.

cheatfate commented 4 years ago

This issue means we are leaking sockets, but this issue has 2 inside

  1. This exception should not be raised by poll().
  2. Eliminate leaks.
stefantalpalaru commented 4 years ago

This exception should not be raised by poll().

It's either that or reaching the maximum number of open file descriptors stays hidden. Which is better?

cheatfate commented 4 years ago

@stefantalpalaru nope it will be raised by appropriate procedure which creates sockets, for example connect() should return this error, but exactly not poll

cheatfate commented 4 years ago

So all exception which are raised by poll are bugs.

cheatfate commented 4 years ago

CC @sinkingsugar we are leaking FDs Nevermind, its chronos problem i'm working on fix.

stefantalpalaru commented 4 years ago

Additional segfault, when lowering the open file descriptor limit further:

prlimit -n50 make SCRIPT_PARAMS="--skipGoerliKey" witti

ERR 2020-06-05 15:17:17+02:00 Transport getMessage error                 topics="discv5" tid=23736 file=protocol.nim:413 exception=TransportOsError msg="(11) Resource temporarily unavailable"
 peers: 6 ❯ epoch: 2149, slot: 18/32 (68786) ❯ finalized epoch: 2 (00247c0b)                                                                    ETH: 0 Traceback (most recent call last, using override)
/mnt/sda3/storage/CODE/status/nim-beacon-chain-clean/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(614) signalHandler
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
socket(unix): Too many open files
socket(unix): Too many open files
socket(unix): Too many open files
socket(unix): Too many open files
socket: Too many open files
DBG 2020-06-05 15:17:17+02:00 UPnP                                       topics="nat" tid=23736 file=nat.nim:48 msg="Miniupnpc Socket error"
 peers: 6 ❯ epoch: 2149, slot: 18/32 (68786) ❯ finalized epoch: 2 (00247c0b)                                                                    ETH: 0 
stefantalpalaru commented 4 years ago

Same command as above, but with --nat=none added to the beacon_node command line in "scripts/connect_to_testnet.nims", allowed it to live long enough to finalise 25 epochs. It still died with:

DBG 2020-06-05 15:35:10+02:00 Exception in poll()                        topics="beacnde" tid=1981 file=beacon_node.nim:718 err="(24) Too many open files" exc=TransportOsError
ERR 2020-06-05 15:35:10+02:00 Transport getMessage error                 topics="discv5" tid=1981 file=protocol.nim:413 exception=TransportOsError msg="(11) Resource temporarily unavailable"
 peers: 9 ❯ epoch: 2152, slot: 11/32 (68875) ❯ finalized epoch: 25 (1ec0174a)                                                                   ETH: 0 Traceback (most recent call last, using override)
/mnt/sda3/storage/CODE/status/nim-beacon-chain-clean/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(614) signalHandler
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

Later edit: the port redirections were probably still there, since they were not deleted at the end of the last run, since the UPnP client could not open a new socket to the router.

cheatfate commented 4 years ago

FD leaks was introduced because of https://github.com/status-im/nim-chronos/commit/d6d0084333b5d6d91b2d710b7c0542d2cd8c4c6f and fixed in https://github.com/status-im/nim-chronos/commit/bedd1ded5edc3bfb6877f7025ca4b21f62492ffe .

So part 2 of this issue was fixed, part 1 fixes are pending.

cheatfate commented 4 years ago

@corpetty this issue is not a blocker for you anymore, but i will close it only after i will introduce fixes for part 1.

mratsim commented 4 years ago

The first part (no exception in poll) will be handled by https://github.com/status-im/nim-libp2p/pull/384 instead of https://github.com/status-im/nim-libp2p/pull/247

tersec commented 3 years ago

https://github.com/status-im/nim-libp2p/pull/384 was merged, and picked up by nimbus-eth2.