input-output-hk / jormungandr

privacy voting blockchain node
https://input-output-hk.github.io/jormungandr/
Apache License 2.0
366 stars 132 forks source link

After running 4 hours the block height get stuck with below error in Mac Os Mojave #1470

Open BitcoinMicroPaymentNetwork opened 4 years ago

BitcoinMicroPaymentNetwork commented 4 years ago

Describe the bug After running 4 hours the block height get stuck in Mac with below error . This is complete log of Jormungandr. Refer the warning in the end of log "WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network"

Log Below - Dec 29 23:03:59.239 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 29 23:22:42.183 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 29 23:22:54.278 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 29 23:35:23.397 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 29 23:36:27.574 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 29 23:36:33.946 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 29 23:37:12.494 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 29 23:37:26.159 ERRO cannot process network headers, reason: deadline has elapsed, task: block Dec 29 23:37:26.224 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 29 23:38:15.499 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 29 23:38:25.625 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 29 23:38:25.625 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 29 23:38:25.630 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 29 23:38:25.635 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 29 23:38:25.635 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 29 23:52:52.488 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 29 23:52:52.492 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 29 23:52:52.496 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 29 23:52:52.499 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 29 23:52:52.503 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 29 23:53:50.033 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 29 23:53:50.861 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 29 23:53:55.219 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 29 23:55:57.186 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 00:14:01.861 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 00:14:01.868 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 00:14:01.872 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 00:47:37.869 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 00:47:37.878 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 00:47:37.894 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 00:47:37.894 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 00:47:37.900 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 00:59:59.880 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 30 01:03:49.431 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 01:03:49.431 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 01:03:49.436 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 01:03:49.439 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 01:03:49.443 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 01:03:49.445 WARN request timed out or failed unexpectedly, error: Error(Elapsed), request: GetHeadersRange, task: client-query Dec 30 01:05:02.275 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 30 01:08:33.994 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 30 01:08:39.802 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 30 01:10:27.271 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 30 01:10:56.963 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 30 01:11:11.741 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 30 01:11:40.976 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 30 01:11:44.304 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 30 01:11:48.459 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network Dec 30 01:11:50.459 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network

Mandatory Information Version 0.8.5

Linicks commented 4 years ago

Also seeing this in my passive node logs:


Dec 31 01:35:10.694 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network
Dec 31 01:35:10.694 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network
Dec 31 01:35:12.214 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network
Dec 31 01:35:12.218 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network
Dec 31 01:35:12.219 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network
Dec 31 01:36:16.311 CRIT Task panicked, task: network
Dec 31 01:36:18.587 WARN Node started without path to the stored secret keys (not a stake pool or a BFT leader), task: init
stakehodlers commented 4 years ago

I have the same issue with v.0.8.4 on ubuntu 18.04.3 LTS, installed from the pre-built aarch64 image on github, and running on a raspberry pi 4, model B. (I am quite certain I have also seen the same behavior on a single-core, 4GB x86 VM.)

After taking a look at various diagnostics, memory use seems to be fine and capped at around 2.7 GB, so I don't think this is a memory leak. Looking at /proc however, yields some very interesting observations. Once the block height stops making progress, the process state transitions to "sleeping". I used gcore to take a dump while in this state, but no symbols are available, otherwise I'd post the stack trace (not sure if my secrets are part of the core, so I'm not comfortable posting the core here). I can post the text file with the output of "thread apply all bt" from gdb, but it's just 30 sets of two-frame stacks showing memory locations. That said, this is consistently reproducible, so it should be easy for someone familiar with the jormungandr design/implementation to have at it in less than a day's wait for the repro.

Smells like a deadlock or livelock in jormungandr and seems to be affecting multiple architectures, given that the OP is using MacOS. Note that prior to hitting this condition, everything was running fine, and the node successfully processed a block overnight.

Please let me know if I can be of further assistance. Thanks!

stakehodlers commented 4 years ago

proc.9264.status.txt oom_scores.txt nodeStats-20191230-1.txt jormungandr-0.8.4-lock-bt.txt

Attached the output of "cat /proc//status" of the jormungandr process, after it hit the stalled state.

The oom_scores file provides an idea of how lightly loaded the system is (score of 153, 11th from bottom, is the jormungandr process... nothing else is close).

The nodeStats file is the output of the jcli node stats command, with the system time (add 5 hours to get UTC) and the block height from pooltool.io appended.

The jormungandr-0.8.4-lock-bt.txt file is what I was able to get out of gdb. Looks like I may need to setup RUST_BACKTRACE, as I did see some better looking backtraces when poking around in the details of other bugs.

Are the following related to 1470?

stakehodlers commented 4 years ago

After additional investigation, it looks like the main process AND all child threads are generally in the sleeping state, so I am unsure that this observation has tremendous value. If jormungandr is implemented using a reactive microservices architecture, for example, then it may sit around in the sleeping state for most of it's existence, and wake only for select moments when work is available for an actor to do (a potentially low probability event, from the perspective of the total number of CPU cycles available for a given unit of time). Engineering intuition still makes me think it's a dead/live lock situation, since max memory use appears to be relatively well bounded.

Waiting for the IOHK Engineering Team to return from their well-deserved holiday break to comment.

BitcoinMicroPaymentNetwork commented 4 years ago

This is more frequent "The Block height gets stuck " . Its happening every couple of hours . Looks like to me a block propagation problem over network , TCP/IP transmission issue . If the node misses the linking block between the current and received block then it will not be able to process any new block and its the end of the node . So I have to restart it to re sync and get to the current height. Most of the node owners is getting this issue and they are restarting their node if the height is stuck for more than 3 minute. It might need to fix the code related to networking to make transmission more reliable across nodes. Possibly something can be done in code for reliable transfer over TCP ?

stakehodlers commented 4 years ago

Upgraded to 0.8.5 today. At first the new release seemed more stable and ran for a solid 4 hours before getting stuck. This evening, after processing one block during our scheduled slot-leader duty, block height subsequently got stuck again. A second slot was scheduled for us to lead within a couple of hours of the first slot, so we needed to reset the node. When the node did not seem to be making progress bootstrapping, the storage directory was cleaned of any SQLite remnants, and the node was rebooted completely to clean out any cruft that may have existed within the Linux kernel on which jormungandr was running. As of the end of this note, we've been in the bootstrapping phase for over 45 minutes now. (Oops... forgot to hit "Comment"...)

stakehodlers commented 4 years ago

Wow! Something was seriously hosed tonight. Bootstrap completed in ~1:13... obviously unacceptable. At first thought, perhaps it had something to do with the number of stake pools in existence, and maybe lots of folks were trying to reboot their nodes simultaneously, putting massive strain on the network. Then I looked at the node stats output... remember: we're in epoch 20...


blockRecvCnt: 5790 lastBlockContentSize: 1228 lastBlockDate: "18.10206" lastBlockFees: 3200000 lastBlockHash: 7c7ec09a1678e17d5a25116727d53e51a4b2b2cc15ea5fc3919d4a2c82d52bb4 lastBlockHeight: "59863" lastBlockSum: 1949114988903 lastBlockTime: "2020-01-01T00:53:49+00:00" lastBlockTx: 4 lastReceivedBlockTime: "2020-01-03T05:11:53+00:00" state: Running txRecvCnt: 466 uptime: 9577 version: jormungandr 0.8.5-3db0680

A whole lotta' forkin' going on? That's why pooltool.io had a warning about forking earlier tonight. I'll save the logs, in case anyone wants to take a look at some point...

(Btw, infrastructure specs are: raspberry pi 4 model B (quad core ARM at ~1.5GHz/core); 4 GB RAM; 1Gb/s Ethernet in and out of the facility, with a speed test that corroborates that bandwidth. This all seems to be in spec for running a jormungandr node, but if recent experience from the community suggests more horsepower is beneficial, we can try running on a beefier machine.)

Executee commented 4 years ago

I have the same issue on ubuntu 19 with 0.8.5 jormungandr. Getting the same PushHeaders warnings every few minutes. After 2 hours my node stops syncing. I have also noticed that my CPU is at 100% constantly after 2 hours running. After restart the CPU is low again.

rasmustwh commented 4 years ago

I can also confirm this behavior with Jormungandr 0.8.5 on Ubuntu 19.10. Shutting down the node (gracefully or not) and restarting seems to "fix" it, however, bootstrapping can end up taking several minutes.