Network health checks and monitoring

phahulin commented 6 years ago

Title

  Title: Network health checks and monitoring
  Layer: Service

Abstract

A system to check network health state from eth point of view should be developed

Rationale

While it is possible to setup monitoring system to check health of individual nodes of the network, it is also important to perform ethereum-specific and consensus-specific health checks on the network as a whole.

Specification

A group of periodically running tests should be setup on both sokol and core. Tests should be separated in individual modules/files and run independently on a schedule. It should be possible to set individual schedule for each test.

Tests should include:

check if any validator nodes are missing rounds
check if payout script works properly for all nodes (check mining address balance)
periodically send a series of txs to check that all validator nodes are able to mine non-empty blocks
periodically send txs via public rpc endpoint
check for reorgs

In case any of the tests fails, notification should be sent to the dev team.

Tests should be protected from starting a new run if the previous run has not completed yet. Tests should be enforced to have a timeout and be killed if they don't complete within certain time.

Test results should be saved for later analysis to a database.

Implementation

Setup a new server on each network, deploy a full parity node. Run tests locally on cron. An account with some small amount of POA will be required to run tests with txs. Save test results to sqlite database. Deploy a simple node.js web app with a single api endpoint to retrieve latest test results from the database. Setup a monitor on this api endpoint, send alerts to slack channel.

natlg commented 6 years ago

Hi @phahulin, I started working on this task. I'm almost done with tests "check if any validator nodes are missing rounds" and "periodically send a series of txs to check that all validator nodes are able to mine non-empty blocks" (code is here ). But I don't understand the last point, could you please explain what reorgs means in the "check for reorgs" test?

phahulin commented 6 years ago

Hi, @Natalya11444

By reorgs I mean forks similar to https://etherscan.io/blocks_forked - events when a node has to rewrite its recent history because it received blocks from a "longer" chain. This one may be tricky to implement and needs some experimenting.

one way is to monitor parity logs directly for messages about reorgs:
```
2018-05-18 20:02:52  Reorg to #1088 0x9478…1f84 (0x191f…700c #1087 0x83b9…e4ca )
```
(just an example taken from my local setup, not from a real network)
another way is to keep hashes of last N blocks (say N = 20) in memory and recheck them to see if any of them changed
this one I haven't tested myself, so can't be sure if it actually works: use https://wiki.parity.io/JSONRPC-Eth-Pub-Sub-Module.html functionality and subscribe to newHeads event
maybe there's another way

You can simulate reorgs on your local setup by using a simplified network with two validators similar to the this one On step 4 you start two parity nodes with different validators. If you let them run for some time, you'll see they're building blocks in parallel, since two nodes yet don't know about each other

Validator1 history: 12:00:00 Block1 --> 12:00:10 Block2 --> 12:00:20 Block3 --> ...
Validator2 history:      12:00:05 Block1 --> 12:00:15 Block2 --> 12:00:25 Block 3 --> ...

then when you call ./mate.sh their enodes are exchanged, and one of them will switch to another one's history - at this moment you'll see a Reorg event in logs.

natlg commented 6 years ago

Thank you for the answer! I'll follow the guide after other tests and will let you know once I finish.

natlg commented 6 years ago

I deployed monitoring for tests 1-3 on the server, please check how it works, then I can change it if needed.

Monitor runs on cron (every 30 minutes for now). It calls web server and sends messages with last failed tests for each network to the slack channel. I used test channel, here is how messages look like:
https://1drv.ms/u/s!Au_4rxfmZk63grpvqjnQggqEVik38g

Web server returns tests results as JSON For the Sokol network: http://poatest.westus.cloudapp.azure.com:3000/sokol/api/failed?lastseconds=3600 will return failed tests for the last hour, "lastseconds" is optional parameter, without it all result from the database will be returned. http://poatest.westus.cloudapp.azure.com:3000/sokol/api/all?lastseconds=3600 - returns both passed and failed test results

For the Core network it's similar: http://poatest.westus.cloudapp.azure.com:3000/core/api/failed

Tests run via cron also, each test is in separate file. They use the command line arguments to detect which network to check. If no arguments are sent, parameters from the toml file will be used. Tests save results to the sqlite database. Test with txs runs on Sokol only, because I don't have account with real POA yet.

Two parity nodes for the each network run both on the server (they use different ports).

Here is what I plan to add:

Remaining tests
Timeout for tests, preventing duplicate cron job executions
Probably add UI for test result, and send a link to it in slack messages
After changing reward algorithm (as in the issue https://github.com/poanetwork/RFC/issues/16) it will be needed to change test for payout script.
Probably add statistic for validator nodes (how many blocks are mined by each of them, how many blocks with txs, rewards)

Repository is here with some more information in the README.

phahulin commented 6 years ago

@Natalya11444 thanks! I'll check it out and let you know

natlg commented 6 years ago

Ok, just I can't use the server now, it will be available at June 9th or 10th.

phahulin commented 6 years ago

Hey @Natalya11444 I'm going through the code and it looks great so far, thank you for your work! Would you mind if I open issues/PRs in your repository with some suggestions?

natlg commented 6 years ago

@phahulin, thank you for checking out, I'm glad you liked it! Sure, please add suggestions in the repository.

natlg commented 6 years ago

I've added UI for test results, with search and filters http://poatest.westus.cloudapp.azure.com:3001 , repository is here. Also remaining tests are implemented, I added timeout and checking for duplicate cron job executions. Tests with sending txs are not running on Core.

For now tests can fail when some validators miss rounds. If it's too long then tests for sending txs fail as well if these validators don't create blocks with them in few rounds. And when they return then reorg can happen.

phahulin commented 6 years ago

@Natalya11444 UI looks great, thank you. I'll try to deploy scripts and ui on our server

natlg commented 6 years ago

@phahulin cool, I'll add more instructions for deployment to readme. Please let me know if there will be some issues.

natlg commented 6 years ago

I updated bash scripts for tests running, they were quite bulky. They can be added to the cron then.

igorbarinov commented 6 years ago

https://github.com/poanetwork/poa-network-monitor

poanetwork / RFC