Node Stability - Githubissues

faddat commented 4 years ago

Platform:

Vultr.com
Ubuntu 18.04
8GB RAM
4 Cores
160GB Storage

Configuration:

I followed these directions exactly: https://github.com/virtualeconomy/v-systems/wiki/How-to-Install-V-Systems-Mainnet-Node

Problem:

The node stopped syncing at 6126862, and the systemd log showed that it was handshaking with just one peer, over and over.

Resolution:

I ran: systemctl restart vsys

and the node began to sync again. I also created a teeny tiny sync monitor tool:

while true
do
curl -X GET "http://127.0.0.1:9922/blocks/height" -H "accept: application/json"
sleep 1
done

Users may want to restrict their API to localhost for security reasons, and this allows them to easily monitor sync progress, albeit in a very basic way.

faddat commented 4 years ago

I haven't been able to reproduce this, so I am closing it because I assume it was a one-off issue.

faddat commented 4 years ago

Reopened due to user report

ghost commented 4 years ago

running on a vps with ryzen 7 8 core processor, 128gb memory, 2gb ssd raid storage.... It didn't seem to start syncing. It was restarted a couple of times but then after a week I tried restarting it and now it is syncing http://95.217.121.243:9922/blocks/height

I can only speculate there is something going into a deadlock when the chain has no blocks yet.

Here is the log:

vsys.log.gz

ok, second issue, the log leaks the private key, about 8th line from the top.

Icermli commented 4 years ago

@stalker-loki I didn't find any clues that may cause syncing to stop from the log. But I suspect it could be a problem of poor network sometimes, especially when your machine is behind a giant firewall for example. We will keep checking any other reasons that may result in this.

I recommend to optimize network and add more peers.

for issue 2: If this is a supernode, I suggest u use a cold wallet to receive reward. And the wallet address in your log file is used only for minting. Don't put any balance in it. If u generate a wallet first and then start the node, the private key won't show up in the log.

BTW, cold wallet minting is one of V Systems Chain's advantages. One just fill in the reward address with a cold wallet address in the config file.Then rewards will goes into the cold wallet rather than the minting address. This secure safety of ur property.

ghost commented 4 years ago

that's this one down the bottom of this section: ?

  miner {
    enable = yes
    offline = no
    quorum = 1
    generation-delay = 1s
    interval-after-last-block-then-generation-is-allowed = 120h
    tf-like-scheduling = no
    reward-address = "ARNzXkeSq81HbzxKLQ9hsAZUpEtvq6sgwj1"
  }

Icermli commented 4 years ago

Yes, reward-address here is not necessarily the minting address. It could be any other address for example a cold wallet address.

faddat commented 4 years ago

I run several VSYS full nodes, some on mainnet, some on testnet. One of them, @stalker-loki runs on my behalf.

They stay up, but stop syncing. Sometimes, they reach a full sync state, and run for a while with the chain's current height. Other times, they just stop. All of my VSYS full nodes are in top-tier datacenters, specifically the hetzner.de datacenters in Germany and Finland.

There are other blockchain nodes on those machines.

The other chains run very happily and without interruption.

Unfortunately, VSYS does not run happily and without interruption.

Ethereum the blockchain weighs in at 236GB. On my Hetzner node, I'm able to sync it in about 12 hours.

VSYS weighs in at ~10GB, and sync takes 24 hours. Additionally, in my experience VSYS full nodes aren't very stable.

This is the spec of my server.

It is in a professionally run datacenter and it's highly unlikely that there are network issues. I run additional nodes on Hetzner machines and VSYS is the only one that frequently either stops syncing during initial standup, or stops advancing block height after it has already synced.

I've observed this issue with VSYS losing sync on machines at my home, as well, where I also run nodes for other blockchains, which do not lose sync.

Today, I was attempting to record a video on one of VSYS' unique concepts, the minting average balance.

Unfortunately, across several nodes, I was unable to discern if the node had simply gone down, or if my once every one to two seconds API request had crashed it:

 for (( ; ; )); do sleep 1; curl -X GET "http://localhost:9922/addresses/balance/details/ARB1zND1qDuNHyVpX5pCVAZSYghGNZSfvAC" -H "accept: application/json"; done

I was just using that to show the increase in MAB, I think it makes a great visual and conversation starter since liquid staking is such a hot topic right now. Interestingly, on VSYS we already have liquid staking, no derivatives needed.

Anyhow, I was unable to complete my video, because the node had crashed:

{
  "address" : "ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7",
  "regular" : 1300000000,
  "mintingAverage" : 1299974400,
  "available" : 1300000000,
  "effective" : 1300000000,
  "height" : 12146933
}{
  "address" : "ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7",
  "regular" : 1300000000,
  "mintingAverage" : 1299974400,
  "available" : 1300000000,
  "effective" : 1300000000,
  "height" : 12146933
}{
  "address" : "ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7",
  "regular" : 1300000000,
  "mintingAverage" : 1299974400,
  "available" : 1300000000,
  "effective" : 1300000000,
  "height" : 12146933
}{
  "address" : "ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7",
  "regular" : 1300000000,
  "mintingAverage" : 1299974400,
  "available" : 1300000000,
  "effective" : 1300000000,
  "height" : 12146933
}{
  "address" : "ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7",
  "regular" : 1300000000,
  "mintingAverage" : 1299974400,
  "available" : 1300000000,
  "effective" : 1300000000,
  "height" : 12146933
}

As you can see, the block height stopped advancing on June 15th at block 12146933. Looking at that node's addresses transaction logs, I did not see anything happen around the 15th. The last transactions that happened through that node are here:

https://explorer.v.systems/address/ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7

That's the node that I used for the db put tutorial. But my db put transactions were on June 9th, so I'm forced to conclude that this is a case of garden-variety instability.

Next, we can look at another node that I run, this one on testnet 0.3. I wanted to try my MAB queries against it. First from my web browser in the swagger UI I confirmed that it had stopped. Then I logged into my machine at Hetzner and I:

root@buildbox ~ # systemctl status vsys
● vsys.service - VSYS full node
   Loaded: loaded (/lib/systemd/system/vsys.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2020-06-11 13:32:23 CEST; 2 weeks 2 days ago
 Main PID: 30025 (java)
    Tasks: 141 (limit: 4915)
   CGroup: /system.slice/vsys.service
           └─30025 java -server -Xms128m -Xmx2g -XX:+UseG1GC -XX:+UseNUMA -XX:+AlwaysPreTouch -XX:+PerfDisableSharedMem -XX:+ParallelRefProcE

Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerConte
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:1
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
Jun 28 13:02:19 buildbox vsys[30025]:         at java.lang.Thread.run(Thread.java:748)
root@buildbox ~ # systemctl restart vsys
root@buildbox ~ # docker run --rm -itd --name vsys -p 8822:9922 -v `pwd`/vsys-chain-data:/opt/coin/data mixhq/vsystems
7d1f65092ac29e55e5ad1d42166835abc67a3e6a43550f962ab37262371b4ba2
root@buildbox ~ # docker ps
CONTAINER ID        IMAGE                 COMMAND                  CREATED             STATUS              PORTS                                                                    NAMES
7d1f65092ac2        mixhq/vsystems        "java -jar v-systems…"   4 seconds ago       Up 2 seconds        9921/tcp, 0.0.0.0:8822->9922/tcp                                         vsys
bec164193a0b        condenser_condenser   "docker-entrypoint.s…"   7 days ago          Up 7 days           0.0.0.0:8080->8080/tcp, 0.0.0.0:35729->35729/tcp, 0.0.0.0:80->8080/tcp   condenser
root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 315323
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 316838
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 317646
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 318454
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 319262
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 320070
root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 334497
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 335422
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 336230
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 336832
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 337341
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 337846
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 338310
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 338755
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 339058
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 339370
}root@buildbox ~ # for (( ; ; )); do sleep 2; curl -X GET "http://95.217.196.54^C922/addresses/balance/details/ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7" -H "accept: application/json"; done
root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 361148
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 361884
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 362389

So, after restarting the service, the node began to sync again.

By this point, I'd gotten pretty curious about stability issues. I'd noticed stability problems on three machines.

So, I put up another mainnet node on a Hetzner server, in docker, on port 8822.

While I was writing this issue, it crashed in exactly the manner that this issue describes:

root@buildbox ~ #curl -X GET "http://95.217.196.54:8822/blocks/height" -H "accept: application/json"
{
  "height" : 558228
}root@buildbox ~ # curl -X GET "http://95.217.196.54:8822/blocks/height" -H "accept: application/json"
{
  "height" : 558228
}root@buildbox ~ # curl -X GET "http://95.217.196.54:8822/blocks/height" -H "accept: application/json"
{
  "height" : 558228
}root@buildbox ~ # curl -X GET "http://95.217.196.54:8822/blocks/height" -H "accept: application/json"
{
  "height" : 558228
}root@buildbox ~ # curl -X GET "http://95.217.196.54:8822/blocks/height" -H "accept: application/json"
{
  "height" : 558228
}root@buildbox ~ # curl -X GET "http://95.217.196.54:8822/blocks/height" -H "accept: application/json"
{
  "height" : 558228

API stays up, block height stops advancing.

This server at hetzner runs a full node for whaleshares, an application specific blockchain focused on social media.

My whaleshares node never ships a beat, staying in sync with the chain's three second block time.

Here's my ethereum node, again, same machine:

This has been quite a long issue, but I decided that it was necessary to provide exhaustive evidence that at present there are at least two stability problems with VSYS nodes:

1) Crashes during initial sync, which are usually resolved by restarting the node 2) Crashes after initial sync, which are also usually resolved by restarting the node

I chose to compare with 2 of the other chains that I run nodes on because unfortunately, my VSYS node is the only one that exhibits this particular issue, and this is not restricted to just machines running in Hetzner datacenters, but instead effects VSYS nodes that I have attempted to run at various times on my personal mac laptop, a home server, and a node on vultr.com.

Log files are available on Slack.

faddat commented 4 years ago

I thought that this may be helpful in troubleshooting. The node mentioned above which is stuck at block: 12146933

is showing this:

curl -X GET "http://95.217.121.243:9922/peers/all" -H "accept: application/json"

{
  "peers": [
    {
      "address": "/3.121.94.10:9921",
      "lastSeen": 9223372036854776000
    },
    {
      "address": "/13.52.40.227:9921",
      "lastSeen": 9223372036854776000
    },
    {
      "address": "/13.55.174.115:9921",
      "lastSeen": 9223372036854776000
    },
    {
      "address": "/13.113.98.91:9921",
      "lastSeen": 9223372036854776000
    }
  ]
}

Healthy VSYS mainnet nodes typically have 34 or 35 peers.

root@buildbox ~ # curl -X GET "https://wallet.v.systems/api/peers/all" -H "accept: application/json"
{
  "peers" : [ {
    "address" : "/13.52.96.166:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/35.177.188.74:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/138.197.196.78:9921",
    "lastSeen" : 1593266885446
  }, {
    "address" : "/54.95.22.119:9921",
    "lastSeen" : 1593349688901
  }, {
    "address" : "/3.121.94.10:9921",
    "lastSeen" : 1593349688178
  }, {
    "address" : "/13.115.105.184:9921",
    "lastSeen" : 1593349688909
  }, {
    "address" : "/3.104.62.227:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/34.196.27.234:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/3.17.31.9:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/165.227.64.201:9921",
    "lastSeen" : 1593266908452
  }, {
    "address" : "/52.60.124.131:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/18.191.26.101:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/35.180.246.64:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/45.76.155.8:9921",
    "lastSeen" : 1593349688889
  }, {
    "address" : "/52.35.120.221:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/54.92.10.151:9921",
    "lastSeen" : 1593349688190
  }, {
    "address" : "/13.52.40.227:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/3.16.244.131:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/54.69.23.204:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/3.17.187.179:9921",
    "lastSeen" : 1593349688147
  } ]
}

ncying commented 4 years ago

This issue may cause by the connection of known peers, and users may solve this by following strategies,

add more in/out bound connections in conf (120 or larger)

network {
# How many network inbound network connections can be made
max-inbound-connections = 120

# Number of outbound network connections
max-outbound-connections = 120
}

add more outbound buffer size in conf (64M or larger)

network {
# Network buffer size
outbound-buffer-size = 64M
}

use jar directly (may not use the .deb service one, the stable issue may cause by the service logic), in this case, I used
```
java -jar v-systems-v***.jar, vsys.conf
```
for more than 10 machines(last week), all of them synced well until now. So, I guess the issue may in the .deb service(may need some extra network-related right/resouce).
About the sync speed, sadly, it also took me 12 hours to sync the whole database. Here are the reasons: a. block mint speed, compare height, eth ~ 10M blocks, but v systems really have 12M blocks, the sync speed is related to the number of blocks. Although in each block eth may record more data than v systems. Btw, such comparison is still meanless, since if you sync the bitcoin network for the core wallet, you need days/weeks for that. It is also related with peers you connected and the network connections in design. b. In order to let the cheaper machine to sync the whole database, we only required less CPUs and lower memory and reduced some performance of the node. c. solutions for this: we may give some copies for the database, node users may download the copies first and start the node from some height. but we still suggest the node user can sync the node from height 0.

In conclusion, most of the stable issues (service, not only v node) are caused by the less system resource allocated, if in the same machine, users run other services with higher resource allocated, it may cause the resource allocation issues to some lower resource needed services. In order to avoid this, one may force allocate more resources to the service. For example, if you run the v node in java directly, you can give more memory in

java -Xmx4096m -jar ***.jar **.conf

allocated more threads in

java -Dscala.concurrent.context.maxExtraThreads=1024 -jar ***.jar **.conf

At last, I will not tag this issue to bug.

faddat commented 4 years ago

If there is an issue in the .deb file that we ship to users that causes instability and downtime, it's a problem.

I mean, we can triage / label this how we would like to, but we are shipping the .deb to users so it is very important that it work properly or not be shipped.

faddat commented 3 years ago

This issue is #230 I imagine.

Unfortunately, #230 has not resolved it yet.

When syncing, nodes still connect to an ancient node and then stop advancing their block height.

faddat commented 3 years ago

my node stopped advancing again

virtualeconomy / v-systems

Node Stability #194