Update Local net to run Spacemesh 0.2 full nodes

avive commented 3 years ago

Running a local net with public testnet 128 node (go-sm tag/v0.1.29 node) and default localnet config will result in stuck verified layer as soon as layer 13 or 15. Almost every network i tried running got stuck before epoch 5. We need to find a good set of configs that balance between consensus liveness and having a great dev experience. e.g. short epoch times and layers and short time to wait until a normal 3rd epoch.

avive commented 3 years ago

@moshababo

countvonzero commented 3 years ago

here is my first round of investigation: $ spacemesh-local-testnet create --old-api-exists=true

try #1

10 nodes were able to continue running to epoch 9. 2 nodes are not synced and stuck at layer 11. 2 nodes are synced to 54/55, but validated layer stayed at layer 11. 6 nodes are synced and validated up 54/55.

i sent a tx and it was successful. a cursory look at the log about why validated layer wasn't caught up. i noticed that when tortoise return oldPbase == 0, then the code failed to get layer for layer 0 and failed validation.

try #2 10 nodes were able to continue running to epoch 8. 1 node killed itself due to time drift. 1 node is not synced and stuck at layer 11. 2 nodes are synced to 48, but validated layer stayed at layer 11. 6 nodes are synced to 48 and validated layer stayed at layer 15.

there is definitely issue with tortoise validation but it seems the network is advancing. i'm also trying not to spend too much time debugging the tortoise since the code is completely rewritten.

@avive can you verify that this behavior is acceptable? thanks

avive commented 3 years ago

@avive can you verify that this behavior is acceptable? thanks

I'm not sure - I believe that a reasonable developer experience expectation is for all nodes on a local devnet to be healthy for at least say ~12 hours before it needs to get restarted.. Having nodes that get stuck make the network seems broken even if the verified layer is not getting stuck.

Are you 100% sure that the verified layer is not getting stuck when some nodes not being able to sync anymore? When verified layer gets stuck txs can be sent but they will never get confirmed so the most basic use case of a Spacemesh network doesn't work. You state above for try #1 that transactions were successful And verified layer got stuck...

I'm not 100% sure who on the team determined that localnet should have 10 nodes and have set up its config. For example - are 10 healthy nodes needed for Hare to work on this network or less? @narayanprusty - perhaps you can share some info about this? Specifically, how did the config params were decided on, and by who on the team? @noamnelke was it you?

What makes the dev experience so poor when verified layer gets stuck is that the localnet needs to be restarted and it takes 45 minutes until transactions and rewards can be tested. Ideally, we'll be able to start a localnet from epoch 2 by default. Once we have this feature, it is less of an issue if a network gets stuck because it is easier to restart it and start hacking against it again...

I'm also not sure if it makes sense to invest more time in trying to fix the localnet with current public testnet node builds as we are getting close to release 0.2. Once we do, the localnet should run 0.2 nodes. @moshababo - maybe a better plan is to try to stabilize a 0.2-nodes localnet?

countvonzero commented 3 years ago

i think my tx is confirmed. see below. i do agree that it's pretty bad that some nodes get stuck in verified layer and some even not synced with just 10 nodes. but my vote would be bringing the testnet/localnet to the latest version before more dev effort is invested.

$ account info
> Local alias: kimmy00
> Address: 0xc1084C991eF1263C17f114569090D9c1b7BE85D7
> Balance: 0 Smidge
> Nonce: 0
> Projected Balance: 0 Smidge
> Projected Nonce: 0
> Projected state includes all pending transactions that haven't been added to the mesh yet.
> Public key: 0xdc79a559d41533e8967daadac1084c991ef1263c17f114569090d9c1b7be85d7
> Private key: 0xe4a67ffbe0b70a500cfe3862e6390956c5e2196327e5d5cac2b0e954116c3211dc79a559d41533e8967daadac1084c991ef1263c17f114569090d9c1b7be85d7
.........
.........
$ account info
> Local alias: kimmy00
> Address: 0xc1084C991eF1263C17f114569090D9c1b7BE85D7
> Balance: 10 Smidge
> Nonce: 0
> Projected Balance: 10 Smidge
> Projected Nonce: 0
> Projected state includes all pending transactions that haven't been added to the mesh yet.
> Public key: 0xdc79a559d41533e8967daadac1084c991ef1263c17f114569090d9c1b7be85d7
> Private key: 0xe4a67ffbe0b70a500cfe3862e6390956c5e2196327e5d5cac2b0e954116c3211dc79a559d41533e8967daadac1084c991ef1263c17f114569090d9c1b7be85d7

avive commented 3 years ago

I think it is best if we revisit localnet and get it working with the v0.2 node release candidate once the long run has passed and not before.

countvonzero commented 3 years ago

thanks @avive

do you think we should close this issue and open a new one that uses v0.2?

avive commented 3 years ago

I would leave it open - as it can be dealt with as soon as 0.2 long-run on a devnet passes.

avive commented 3 years ago

I think that you can now try to creating a localnet from go-sm develop branch codebase which has most of the 0.2 code and see if you can stabilize it.

The config file should be quite similar to the devnet configuration that is being worked on (see https://storage.googleapis.com/spacecraft-data/devnet-archive/config.json) but there might be some changes needed to it.

I believe that the key is to first find a good set of config params that has a chance to not fail consensus, so if consensus breaks we know this is not due to a bad config and due to a node bug. We had many devenets and localnets previously which failed after few epochs and it turned out that the config params were incompatible with each other.

@moshababo

avive commented 3 years ago

@narayanprusty - getting an error running spacemesh-local-testnet create --elk true. No error w/o elk specified:

Starting ELK
Creating network "elk_elk" with driver "bridge"
Creating elk_elasticsearch_1 ... 
Creating elk_elasticsearch_1 ... error

ERROR: for elk_elasticsearch_1  Cannot create container for service elasticsearch: invalid mount config for type "bind": bind source path does not exist: /opt/homebrew/lib/node_modules/spacemesh-local-testnet/src/elk/elasticsearch/config/elasticsearch.yml

ERROR: for elasticsearch  Cannot create container for service elasticsearch: invalid mount config for type "bind": bind source path does not exist: /opt/homebrew/lib/node_modules/spacemesh-local-testnet/src/elk/elasticsearch/config/elasticsearch.yml
Encountered errors while bringing up the project.
Error: undefined

platform: macOS/M1 (docker for ARM) - this might be the cause although docker for arm supports amd64 images.

avive commented 3 years ago

I'm getting odd status from 0.2 localnet nodes. How can current layer be less than synced and verified layers? sycned and verified layers should always be less or equal to current layer (basically the tip of the mesh).

$ status node
Node info:
Version: v0.0.0-unreleased
Build: 23d84097f01092caa8453a4a294ec2742f97f78c
API server: localhost:6003 (GRPC API 1.1). >> Insecure Connection. Use only with a local trusted server <<
Synced: false
Synced layer: 11
Current layer: 6
Verified layer: 11
Peers: 9

avive commented 3 years ago

I can't get the managed nodes to sync. Synced and verified layers both stuck at 11 on my localnet. Using default script setting w/o any previous docker node or poet images. I've tried both on macOS/amd64 and on macOS/arm/m1.

spacemeshos / local-testnet

Update Local net to run Spacemesh 0.2 full nodes #43