Open phahulin opened 6 years ago
Hi @phahulin, I started working on this task. I'm almost done with tests "check if any validator nodes are missing rounds" and "periodically send a series of txs to check that all validator nodes are able to mine non-empty blocks" (code is here ). But I don't understand the last point, could you please explain what reorgs means in the "check for reorgs" test?
Hi, @Natalya11444
By reorgs I mean forks similar to https://etherscan.io/blocks_forked - events when a node has to rewrite its recent history because it received blocks from a "longer" chain. This one may be tricky to implement and needs some experimenting.
one way is to monitor parity logs directly for messages about reorgs:
2018-05-18 20:02:52 Reorg to #1088 0x9478…1f84 (0x191f…700c #1087 0x83b9…e4ca )
(just an example taken from my local setup, not from a real network)
another way is to keep hashes of last N blocks (say N = 20) in memory and recheck them to see if any of them changed
this one I haven't tested myself, so can't be sure if it actually works: use https://wiki.parity.io/JSONRPC-Eth-Pub-Sub-Module.html functionality and subscribe to newHeads
event
maybe there's another way
You can simulate reorgs on your local setup by using a simplified network with two validators similar to the this one
On step 4
you start two parity nodes with different validators. If you let them run for some time, you'll see they're building blocks in parallel, since two nodes yet don't know about each other
Validator1 history: 12:00:00 Block1 --> 12:00:10 Block2 --> 12:00:20 Block3 --> ...
Validator2 history: 12:00:05 Block1 --> 12:00:15 Block2 --> 12:00:25 Block 3 --> ...
then when you call ./mate.sh
their enodes are exchanged, and one of them will switch to another one's history - at this moment you'll see a Reorg
event in logs.
Thank you for the answer! I'll follow the guide after other tests and will let you know once I finish.
I deployed monitoring for tests 1-3 on the server, please check how it works, then I can change it if needed.
Monitor runs on cron (every 30 minutes for now). It calls web server and sends messages with last failed tests for each network to the slack channel.
I used test channel, here is how messages look like:
https://1drv.ms/u/s!Au_4rxfmZk63grpvqjnQggqEVik38g
Web server returns tests results as JSON For the Sokol network: http://poatest.westus.cloudapp.azure.com:3000/sokol/api/failed?lastseconds=3600 will return failed tests for the last hour, "lastseconds" is optional parameter, without it all result from the database will be returned. http://poatest.westus.cloudapp.azure.com:3000/sokol/api/all?lastseconds=3600 - returns both passed and failed test results
For the Core network it's similar: http://poatest.westus.cloudapp.azure.com:3000/core/api/failed
Tests run via cron also, each test is in separate file. They use the command line arguments to detect which network to check. If no arguments are sent, parameters from the toml file will be used. Tests save results to the sqlite database. Test with txs runs on Sokol only, because I don't have account with real POA yet.
Two parity nodes for the each network run both on the server (they use different ports).
Here is what I plan to add:
Repository is here with some more information in the README.
@Natalya11444 thanks! I'll check it out and let you know
Ok, just I can't use the server now, it will be available at June 9th or 10th.
Hey @Natalya11444 I'm going through the code and it looks great so far, thank you for your work! Would you mind if I open issues/PRs in your repository with some suggestions?
@phahulin, thank you for checking out, I'm glad you liked it! Sure, please add suggestions in the repository.
I've added UI for test results, with search and filters http://poatest.westus.cloudapp.azure.com:3001 , repository is here. Also remaining tests are implemented, I added timeout and checking for duplicate cron job executions. Tests with sending txs are not running on Core.
For now tests can fail when some validators miss rounds. If it's too long then tests for sending txs fail as well if these validators don't create blocks with them in few rounds. And when they return then reorg can happen.
@Natalya11444 UI looks great, thank you. I'll try to deploy scripts and ui on our server
@phahulin cool, I'll add more instructions for deployment to readme. Please let me know if there will be some issues.
I updated bash scripts for tests running, they were quite bulky. They can be added to the cron then.
Title
Abstract
A system to check network health state from eth point of view should be developed
Rationale
While it is possible to setup monitoring system to check health of individual nodes of the network, it is also important to perform ethereum-specific and consensus-specific health checks on the network as a whole.
Specification
A group of periodically running tests should be setup on both
sokol
andcore
. Tests should be separated in individual modules/files and run independently on a schedule. It should be possible to set individual schedule for each test.Tests should include:
In case any of the tests fails, notification should be sent to the dev team.
Tests should be protected from starting a new run if the previous run has not completed yet. Tests should be enforced to have a timeout and be killed if they don't complete within certain time.
Test results should be saved for later analysis to a database.
Implementation
Setup a new server on each network, deploy a full parity node. Run tests locally on cron. An account with some small amount of POA will be required to run tests with txs. Save test results to sqlite database. Deploy a simple node.js web app with a single api endpoint to retrieve latest test results from the database. Setup a monitor on this api endpoint, send alerts to slack channel.