Closed trailtacos closed 4 years ago
Node has hung again:
2020-06-01 17:32:25 UTC Verifier #0 INFO import Imported #10181340 0x17b7…bfa3 (129 txs, 9.98 Mgas, 656 ms, 29.47 KiB)
2020-06-01 17:32:25 UTC IO Worker #0 INFO import 29/50 peers 131 MiB chain 306 MiB db 0 bytes queue 11 MiB sync RPC: 0 conn, 18 req/s, 248 µs
2020-06-01 17:32:29 UTC Verifier #0 INFO import Imported #10181341 0x4a4f…3a68 (110 txs, 9.97 Mgas, 1064 ms, 29.13 KiB)
2020-06-01 17:32:33 UTC IO Worker #3 INFO snapshot::service Snapshot: 25701656 accounts, 5000 blocks, 5446623304 bytes
2020-06-01 17:32:37 UTC Verifier #0 INFO import Imported #10181342 0x7efd…b3eb (149 txs, 9.98 Mgas, 1909 ms, 30.75 KiB)
2020-06-01 17:32:43 UTC IO Worker #0 INFO snapshot::service Snapshot: 25701656 accounts, 5000 blocks, 5446623304 bytes
2020-06-01 17:32:53 UTC IO Worker #3 INFO snapshot::service Snapshot: 25701656 accounts, 5000 blocks, 5446623304 bytes
I now have an strace of it hanging and a gdb thread dump when it has hung...
To complement @CrispinFlowerday:
$ date && tail -1 /var/log/openethereum.log && curl --data '{"method":"eth_blockNumber","params":[],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
Mon Jun 1 17:52:01 UTC 2020
2020-06-01 17:32:53 UTC IO Worker #3 INFO snapshot::service Snapshot: 25701656 accounts, 5000 blocks, 5446623304 bytes
{"jsonrpc":"2.0","result":"0x9b5ade","id":1}
I now have an strace of it hanging and a gdb thread dump when it has hung...
Oh, that sounds like something that'd be plenty useful. Can you share them?
This issue is here since 2.7.2 https://github.com/openethereum/openethereum/issues/11539 and https://github.com/openethereum/openethereum/issues/11494 I managed to get it working after downgrading back to 2.5. Now after upgrading nodes to 3.0 it's back again. I tasted it for about 5 days, looked stable so upgraded the whole network. Few days later it started to get stuck.
Is there a way to downgrade again?
I'm experiencing similar issues on the classic chain after upgrading from 2.7.2 to 3.0.1 to tackle #11744 . Now its stuck on a block and won't sync past it.
2020-06-02 09:29:02 UTC IO Worker #3 INFO import Syncing #10500973 0xa98f…4491 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #10501210 40/50 peers 3 MiB chain 514 MiB db 0 bytes queue 7 MiB sync RPC: 0 conn, 0 req/s, 15 µs
(Copied from #11494 since this seems more relevant)
For me, 2.7.2 completely stops syncing every few days. The server still runs and RPC calls are still answered, but it just doesn't get any new blocks. Even though it shows 10+ connections. I have full debug logs of a few instances, but they don't seem to say anything interesting.
I think it got better when I enabled --no-periodic-snapshot
(now only once per week), but it still happens. I wrote a script that just kills and restarts the parity node when it's out of date by more than 10 minutes, that works.. but it's obviously pretty dumb.
Agreed we have this same issue - full historical archive node with 8TB SSD etc etc.
Hangs and has to be killed and restarted. we are running alternate nodes on 2.5.13 that do not have this issue.
Here's my kill script to work around this by request from @kcanty:
https://gist.github.com/phiresky/153411abb794ad61eb4ba949236bc6ce
it's written in JS since it has to connect to the node and does some rudimentary checks.
Thanks @phiresky - appreciate it - looks good.
There is definitely something going on since 2.7.2 (database format changes?).
2.7.2 and 3.0.0 either sync too slow to ever catch up to tip, or freeze (stop importing blocks) periodically.
Happens on full/fast and archive nodes, both with tracing.
2.5.13 has no issues.
I've been watching this issue and other issues like it on OpenEthereum, and I'm just wondering if this has any effect on the pending Berlin hard fork. I guess if OpenEthereum nodes can't sync, it doesn't really have a deleterious effect on the chain, but if everyone is forced to upgrade because of the Berlin fork, and a good portion of those newly upgraded nodes can't sync, doesn't that lower the overall security of the system as a whole. Is there any sense of how many OpenEthereum nodes are experiencing this problem? I've not yet upgraded to 2.7.2 because of this issue, but with Berlin approaching, I will have to.
So... My openethereum node is currently hung - I'm happy to do any debugging on it over the next 48 hours that would be helpful to allow devs to track this issue down:
2020-06-10 20:40:10 UTC IO Worker #1 INFO import 30/50 peers 147 MiB chain 306 MiB db 0 bytes queue 13 MiB sync RPC: 0 conn, 0 req/s, 0 µs
2020-06-10 20:40:12 UTC IO Worker #1 INFO snapshot::service Snapshot: 59522653 accounts, 5000 blocks, 11935564982 bytes
2020-06-10 20:40:22 UTC IO Worker #3 INFO snapshot::service Snapshot: 59522653 accounts, 5000 blocks, 11935564982 bytes
2020-06-10 20:40:29 UTC Verifier #0 INFO import Imported #10240161 0x4314…b2db (88 txs, 9.93 Mgas, 2141 ms, 13.82 KiB)
2020-06-10 20:40:31 UTC Verifier #0 INFO import Imported #10240162 0x82ae…6abd (149 txs, 9.95 Mgas, 1859 ms, 39.22 KiB)
<no more logs>
After first and single restart (since synced and consumed 21 gb of virtual memory / 5 res) OpenEthereum/v3.0.1-stable-8ca8089-20200601/ works fine for me for 5 days on ETC (currently 12.5 gb of virtual memory).
Parity-Ethereum/v2.6.8-beta-9bf6ed8-20191231/ works fine on ETH, but it can stop syncing, if resident memory ends up. It is very old issue, came from eairlier versions.
May be last versions have a memory leak, that leads to consuming all available RAM and thus, stopping syncing.
This is currently a blocker for us to support the upcoming Berlin fork.
One data point - our (I work with Vogelito) stuck openethereum node did, eventually emit more logs:
2020-06-10 20:40:22 UTC IO Worker #3 INFO snapshot::service Snapshot: 59522653 accounts, 5000 blocks, 11935564982 bytes
2020-06-10 20:40:29 UTC Verifier #0 INFO import Imported #10240161 0x4314…b2db (88 txs, 9.93 Mgas, 2141 ms, 13.82 KiB)
2020-06-10 20:40:31 UTC Verifier #0 INFO import Imported #10240162 0x82ae…6abd (149 txs, 9.95 Mgas, 1859 ms, 39.22 KiB)
2020-06-13 09:41:40 UTC Periodic Snapshot INFO snapshot Took a snapshot at #10205000 of 88364089 accounts
2020-06-13 09:41:40 UTC Periodic Snapshot INFO snapshot produced 4878 state chunks and 112 block chunks.
2020-06-13 09:41:40 UTC Periodic Snapshot INFO snapshot::service Finished taking snapshot at #10205000, in 689772s
I've now restarted it with logging set to debug
to see if that emits anything useful
Our node hung again today, and I have more detailed logging - for reference the block it is reporting is 0x9db235
aka 10334773 - which has a block time of Jun-25-2020 11:55:17 AM +UTC. Attached is the last 100k of rows from our node that has debug logging enabled across the board.
My observations is just that logging apparently stops from the IO Workers - there is a large gap in logging before the logs from my queries for its block number start. No idea if this is helpful at all.
Cross-referencing potentially related issues:
Looks like #11758.
Last night, our openethereum node stopped importing blocks. A restart brought it back to life, but this is an issue for us if it can't keep up to date.
Some observations: 1) It was using ~60% CPU (according to TOP) 2) JSON RPC was still operating 3) Logs showed nothing at all (they just stopped) 4) I captured an strace of the openethereum process if that is helpful.
We have been running the same configuration on parity (I believe we were running 2.5.13 prior to upgrading). Our openethereum node has been alive for about 10 days - it was created from a snapshot of the parity DB.
Its configuration is: