Openethereum node stopped importing blocks

trailtacos commented 4 years ago

OpenEthereum version: 3.0.0
Operating system: Linux, Ubuntu 18.04
Installation: Built from source
Fully synchronized: Yes
Network: ethereum
Restarted: N/A (issue is that it stopped importing blocks) Restart brought it back to life

Last night, our openethereum node stopped importing blocks. A restart brought it back to life, but this is an issue for us if it can't keep up to date.

Some observations: 1) It was using ~60% CPU (according to TOP) 2) JSON RPC was still operating 3) Logs showed nothing at all (they just stopped) 4) I captured an strace of the openethereum process if that is helpful.

We have been running the same configuration on parity (I believe we were running 2.5.13 prior to upgrading). Our openethereum node has been alive for about 10 days - it was created from a snapshot of the parity DB.

2020-05-30 05:53:37 UTC Verifier #0 INFO import  Imported #10165361 0xe39e…1949 (276 txs, 9.99 Mgas, 1286 ms, 39.56 KiB)
2020-05-30 05:53:39 UTC IO Worker #0 INFO snapshot::service  Snapshot: 6978582 accounts, 5000 blocks, 1756802596 bytes
2020-05-30 05:53:40 UTC IO Worker #1 INFO import    30/50 peers    128 MiB chain  306 MiB db  0 bytes queue   25 MiB sync  RPC:  0 conn,    1 req/s,  237 µs
2020-05-30 05:53:49 UTC IO Worker #0 INFO snapshot::service  Snapshot: 6978582 accounts, 5000 blocks, 1756802596 bytes
2020-05-30 05:53:59 UTC IO Worker #0 INFO snapshot::service  Snapshot: 6978582 accounts, 5000 blocks, 1756802596 bytes
2020-05-30 05:54:09 UTC IO Worker #1 INFO snapshot::service  Snapshot: 6978582 accounts, 5000 blocks, 1756802596 bytes
2020-05-30 05:54:10 UTC IO Worker #1 INFO import    30/50 peers    128 MiB chain  306 MiB db  0 bytes queue   25 MiB sync  RPC:  0 conn,   42 req/s,  206 µs
2020-05-30 14:47:30 UTC main INFO openethereum::run  Starting OpenEthereum/v3.0.0-stable-b079d1746-20200511/x86_64-unknown-linux-gnu/rustc1.43.1
2020-05-30 14:47:30 UTC main INFO openethereum::run  Keys path /root/.local/share/io.parity.ethereum/keys/ethereum
2020-05-30 14:47:30 UTC main INFO openethereum::run  DB path /root/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d

Its configuration is:

[parity]
mode = "last"
auto_update = "none"

[footprint]
tracing = "on"
pruning = "archive"
cache_size = 2048
db_compaction = "ssd"
scale_verifiers = true
num_verifiers = 8

[misc]
logging = "own_tx=trace"
log_file = "/var/log/openethereum.log"
color = true

[rpc]
disable = false
port = 8545
interface = "all"
server_threads = 8

[network]
min_peers = 25
max_peers = 50

trailtacos commented 4 years ago

Node has hung again:

2020-06-01 17:32:25 UTC Verifier #0 INFO import  Imported #10181340 0x17b7…bfa3 (129 txs, 9.98 Mgas, 656 ms, 29.47 KiB)
2020-06-01 17:32:25 UTC IO Worker #0 INFO import    29/50 peers    131 MiB chain  306 MiB db  0 bytes queue   11 MiB sync  RPC:  0 conn,   18 req/s,  248 µs
2020-06-01 17:32:29 UTC Verifier #0 INFO import  Imported #10181341 0x4a4f…3a68 (110 txs, 9.97 Mgas, 1064 ms, 29.13 KiB)
2020-06-01 17:32:33 UTC IO Worker #3 INFO snapshot::service  Snapshot: 25701656 accounts, 5000 blocks, 5446623304 bytes
2020-06-01 17:32:37 UTC Verifier #0 INFO import  Imported #10181342 0x7efd…b3eb (149 txs, 9.98 Mgas, 1909 ms, 30.75 KiB)
2020-06-01 17:32:43 UTC IO Worker #0 INFO snapshot::service  Snapshot: 25701656 accounts, 5000 blocks, 5446623304 bytes
2020-06-01 17:32:53 UTC IO Worker #3 INFO snapshot::service  Snapshot: 25701656 accounts, 5000 blocks, 5446623304 bytes

trailtacos commented 4 years ago

I now have an strace of it hanging and a gdb thread dump when it has hung...

vogelito commented 4 years ago

To complement @CrispinFlowerday:

$ date && tail -1 /var/log/openethereum.log && curl --data '{"method":"eth_blockNumber","params":[],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
Mon Jun  1 17:52:01 UTC 2020
2020-06-01 17:32:53 UTC IO Worker #3 INFO snapshot::service  Snapshot: 25701656 accounts, 5000 blocks, 5446623304 bytes
{"jsonrpc":"2.0","result":"0x9b5ade","id":1}

dvdplm commented 4 years ago

I now have an strace of it hanging and a gdb thread dump when it has hung...

Oh, that sounds like something that'd be plenty useful. Can you share them?

trailtacos commented 4 years ago

Here they are.

openethereum_threads.2020-06-01.txt.gz openethereum-2020-05-30.strace.gz

pstroinski commented 4 years ago

This issue is here since 2.7.2 https://github.com/openethereum/openethereum/issues/11539 and https://github.com/openethereum/openethereum/issues/11494 I managed to get it working after downgrading back to 2.5. Now after upgrading nodes to 3.0 it's back again. I tasted it for about 5 days, looked stable so upgraded the whole network. Few days later it started to get stuck.

Is there a way to downgrade again?

vkconsult commented 4 years ago

I'm experiencing similar issues on the classic chain after upgrading from 2.7.2 to 3.0.1 to tackle #11744 . Now its stuck on a block and won't sync past it. 2020-06-02 09:29:02 UTC IO Worker #3 INFO import Syncing #10500973 0xa98f…4491 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #10501210 40/50 peers 3 MiB chain 514 MiB db 0 bytes queue 7 MiB sync RPC: 0 conn, 0 req/s, 15 µs

phiresky commented 4 years ago

(Copied from #11494 since this seems more relevant)

For me, 2.7.2 completely stops syncing every few days. The server still runs and RPC calls are still answered, but it just doesn't get any new blocks. Even though it shows 10+ connections. I have full debug logs of a few instances, but they don't seem to say anything interesting.

I think it got better when I enabled --no-periodic-snapshot (now only once per week), but it still happens. I wrote a script that just kills and restarts the parity node when it's out of date by more than 10 minutes, that works.. but it's obviously pretty dumb.

kcanty commented 4 years ago

Agreed we have this same issue - full historical archive node with 8TB SSD etc etc.

Hangs and has to be killed and restarted. we are running alternate nodes on 2.5.13 that do not have this issue.

phiresky commented 4 years ago

Here's my kill script to work around this by request from @kcanty:

https://gist.github.com/phiresky/153411abb794ad61eb4ba949236bc6ce

it's written in JS since it has to connect to the node and does some rudimentary checks.

kcanty commented 4 years ago

Thanks @phiresky - appreciate it - looks good.

nysxah commented 4 years ago

There is definitely something going on since 2.7.2 (database format changes?).

2.7.2 and 3.0.0 either sync too slow to ever catch up to tip, or freeze (stop importing blocks) periodically.

Happens on full/fast and archive nodes, both with tracing.

2.5.13 has no issues.

tjayrush commented 4 years ago

I've been watching this issue and other issues like it on OpenEthereum, and I'm just wondering if this has any effect on the pending Berlin hard fork. I guess if OpenEthereum nodes can't sync, it doesn't really have a deleterious effect on the chain, but if everyone is forced to upgrade because of the Berlin fork, and a good portion of those newly upgraded nodes can't sync, doesn't that lower the overall security of the system as a whole. Is there any sense of how many OpenEthereum nodes are experiencing this problem? I've not yet upgraded to 2.7.2 because of this issue, but with Berlin approaching, I will have to.

trailtacos commented 4 years ago

So... My openethereum node is currently hung - I'm happy to do any debugging on it over the next 48 hours that would be helpful to allow devs to track this issue down:

2020-06-10 20:40:10 UTC IO Worker #1 INFO import    30/50 peers    147 MiB chain  306 MiB db  0 bytes queue   13 MiB sync  RPC:  0 conn,    0 req/s,    0 µs
2020-06-10 20:40:12 UTC IO Worker #1 INFO snapshot::service  Snapshot: 59522653 accounts, 5000 blocks, 11935564982 bytes
2020-06-10 20:40:22 UTC IO Worker #3 INFO snapshot::service  Snapshot: 59522653 accounts, 5000 blocks, 11935564982 bytes
2020-06-10 20:40:29 UTC Verifier #0 INFO import  Imported #10240161 0x4314…b2db (88 txs, 9.93 Mgas, 2141 ms, 13.82 KiB)
2020-06-10 20:40:31 UTC Verifier #0 INFO import  Imported #10240162 0x82ae…6abd (149 txs, 9.95 Mgas, 1859 ms, 39.22 KiB)
<no more logs>

mr-older commented 4 years ago

After first and single restart (since synced and consumed 21 gb of virtual memory / 5 res) OpenEthereum/v3.0.1-stable-8ca8089-20200601/ works fine for me for 5 days on ETC (currently 12.5 gb of virtual memory).

Parity-Ethereum/v2.6.8-beta-9bf6ed8-20191231/ works fine on ETH, but it can stop syncing, if resident memory ends up. It is very old issue, came from eairlier versions.

May be last versions have a memory leak, that leads to consuming all available RAM and thus, stopping syncing.

vogelito commented 4 years ago

This is currently a blocker for us to support the upcoming Berlin fork.

trailtacos commented 4 years ago

One data point - our (I work with Vogelito) stuck openethereum node did, eventually emit more logs:

2020-06-10 20:40:22 UTC IO Worker #3 INFO snapshot::service  Snapshot: 59522653 accounts, 5000 blocks, 11935564982 bytes
2020-06-10 20:40:29 UTC Verifier #0 INFO import  Imported #10240161 0x4314…b2db (88 txs, 9.93 Mgas, 2141 ms, 13.82 KiB)
2020-06-10 20:40:31 UTC Verifier #0 INFO import  Imported #10240162 0x82ae…6abd (149 txs, 9.95 Mgas, 1859 ms, 39.22 KiB)
2020-06-13 09:41:40 UTC Periodic Snapshot INFO snapshot  Took a snapshot at #10205000 of 88364089 accounts
2020-06-13 09:41:40 UTC Periodic Snapshot INFO snapshot  produced 4878 state chunks and 112 block chunks.
2020-06-13 09:41:40 UTC Periodic Snapshot INFO snapshot::service  Finished taking snapshot at #10205000, in 689772s

I've now restarted it with logging set to debug to see if that emits anything useful

trailtacos commented 4 years ago

Our node hung again today, and I have more detailed logging - for reference the block it is reporting is 0x9db235 aka 10334773 - which has a block time of Jun-25-2020 11:55:17 AM +UTC. Attached is the last 100k of rows from our node that has debug logging enabled across the board.

My observations is just that logging apparently stops from the IO Workers - there is a large gap in logging before the logs from my queries for its block number start. No idea if this is helpful at all.

openethereum.log.gz

kael-shipman commented 4 years ago

Cross-referencing potentially related issues:

11793
11758
11683

vorot93 commented 4 years ago

Looks like #11758.

openethereum / parity-ethereum

Openethereum node stopped importing blocks #11737

11793

11758

11683