Closed avive closed 3 years ago
@moshababo
here is my first round of investigation: $ spacemesh-local-testnet create --old-api-exists=true
try #1
10 nodes were able to continue running to epoch 9. 2 nodes are not synced and stuck at layer 11. 2 nodes are synced to 54/55, but validated layer stayed at layer 11. 6 nodes are synced and validated up 54/55.
i sent a tx and it was successful. a cursory look at the log about why validated layer wasn't caught up. i noticed that when tortoise return oldPbase == 0, then the code failed to get layer for layer 0 and failed validation.
try #2 10 nodes were able to continue running to epoch 8. 1 node killed itself due to time drift. 1 node is not synced and stuck at layer 11. 2 nodes are synced to 48, but validated layer stayed at layer 11. 6 nodes are synced to 48 and validated layer stayed at layer 15.
there is definitely issue with tortoise validation but it seems the network is advancing. i'm also trying not to spend too much time debugging the tortoise since the code is completely rewritten.
@avive can you verify that this behavior is acceptable? thanks
@avive can you verify that this behavior is acceptable? thanks
I'm not sure - I believe that a reasonable developer experience expectation is for all nodes on a local devnet to be healthy for at least say ~12 hours before it needs to get restarted.. Having nodes that get stuck make the network seems broken even if the verified layer is not getting stuck.
Are you 100% sure that the verified layer is not getting stuck when some nodes not being able to sync anymore? When verified layer gets stuck txs can be sent but they will never get confirmed so the most basic use case of a Spacemesh network doesn't work. You state above for try #1 that transactions were successful And verified layer got stuck...
I'm not 100% sure who on the team determined that localnet should have 10 nodes and have set up its config. For example - are 10 healthy nodes needed for Hare to work on this network or less? @narayanprusty - perhaps you can share some info about this? Specifically, how did the config params were decided on, and by who on the team? @noamnelke was it you?
What makes the dev experience so poor when verified layer gets stuck is that the localnet needs to be restarted and it takes 45 minutes until transactions and rewards can be tested. Ideally, we'll be able to start a localnet from epoch 2 by default. Once we have this feature, it is less of an issue if a network gets stuck because it is easier to restart it and start hacking against it again...
I'm also not sure if it makes sense to invest more time in trying to fix the localnet with current public testnet node builds as we are getting close to release 0.2. Once we do, the localnet should run 0.2 nodes. @moshababo - maybe a better plan is to try to stabilize a 0.2-nodes localnet?
i think my tx is confirmed. see below. i do agree that it's pretty bad that some nodes get stuck in verified layer and some even not synced with just 10 nodes. but my vote would be bringing the testnet/localnet to the latest version before more dev effort is invested.
$ account info
> Local alias: kimmy00
> Address: 0xc1084C991eF1263C17f114569090D9c1b7BE85D7
> Balance: 0 Smidge
> Nonce: 0
> Projected Balance: 0 Smidge
> Projected Nonce: 0
> Projected state includes all pending transactions that haven't been added to the mesh yet.
> Public key: 0xdc79a559d41533e8967daadac1084c991ef1263c17f114569090d9c1b7be85d7
> Private key: 0xe4a67ffbe0b70a500cfe3862e6390956c5e2196327e5d5cac2b0e954116c3211dc79a559d41533e8967daadac1084c991ef1263c17f114569090d9c1b7be85d7
.........
.........
$ account info
> Local alias: kimmy00
> Address: 0xc1084C991eF1263C17f114569090D9c1b7BE85D7
> Balance: 10 Smidge
> Nonce: 0
> Projected Balance: 10 Smidge
> Projected Nonce: 0
> Projected state includes all pending transactions that haven't been added to the mesh yet.
> Public key: 0xdc79a559d41533e8967daadac1084c991ef1263c17f114569090d9c1b7be85d7
> Private key: 0xe4a67ffbe0b70a500cfe3862e6390956c5e2196327e5d5cac2b0e954116c3211dc79a559d41533e8967daadac1084c991ef1263c17f114569090d9c1b7be85d7
I think it is best if we revisit localnet and get it working with the v0.2 node release candidate once the long run has passed and not before.
thanks @avive
do you think we should close this issue and open a new one that uses v0.2?
I would leave it open - as it can be dealt with as soon as 0.2 long-run on a devnet passes.
I think that you can now try to creating a localnet from go-sm develop branch codebase which has most of the 0.2 code and see if you can stabilize it.
The config file should be quite similar to the devnet configuration that is being worked on (see https://storage.googleapis.com/spacecraft-data/devnet-archive/config.json) but there might be some changes needed to it.
I believe that the key is to first find a good set of config params that has a chance to not fail consensus, so if consensus breaks we know this is not due to a bad config and due to a node bug. We had many devenets and localnets previously which failed after few epochs and it turned out that the config params were incompatible with each other.
@moshababo
@narayanprusty - getting an error running spacemesh-local-testnet create --elk true
. No error w/o elk specified:
Starting ELK
Creating network "elk_elk" with driver "bridge"
Creating elk_elasticsearch_1 ...
Creating elk_elasticsearch_1 ... error
ERROR: for elk_elasticsearch_1 Cannot create container for service elasticsearch: invalid mount config for type "bind": bind source path does not exist: /opt/homebrew/lib/node_modules/spacemesh-local-testnet/src/elk/elasticsearch/config/elasticsearch.yml
ERROR: for elasticsearch Cannot create container for service elasticsearch: invalid mount config for type "bind": bind source path does not exist: /opt/homebrew/lib/node_modules/spacemesh-local-testnet/src/elk/elasticsearch/config/elasticsearch.yml
Encountered errors while bringing up the project.
Error: undefined
platform: macOS/M1 (docker for ARM) - this might be the cause although docker for arm supports amd64 images.
I'm getting odd status from 0.2 localnet nodes. How can current layer be less than synced and verified layers? sycned and verified layers should always be less or equal to current layer (basically the tip of the mesh).
$ status node
Node info:
Version: v0.0.0-unreleased
Build: 23d84097f01092caa8453a4a294ec2742f97f78c
API server: localhost:6003 (GRPC API 1.1). >> Insecure Connection. Use only with a local trusted server <<
Synced: false
Synced layer: 11
Current layer: 6
Verified layer: 11
Peers: 9
I can't get the managed nodes to sync. Synced and verified layers both stuck at 11 on my localnet. Using default script setting w/o any previous docker node or poet images. I've tried both on macOS/amd64 and on macOS/arm/m1.
Running a local net with public testnet 128 node (go-sm tag/v0.1.29 node) and default localnet config will result in stuck verified layer as soon as layer 13 or 15. Almost every network i tried running got stuck before epoch 5. We need to find a good set of configs that balance between consensus liveness and having a great dev experience. e.g. short epoch times and layers and short time to wait until a
normal
3rd epoch.