New service for network statistics

phahulin commented 6 years ago

Title

  Title: New service for network statistics
  Layer: Services

Abstract

A new service to gather and process network statistics is proposed

Rationale

At the present moment network statistics is gathered by a swarm of agents installed on network nodes that send it to a central server which displays it in a dashboard-like web interface. Together agents and dashboard make up two parts of a single service.

Current implementation has several shortcomings:

same dashboard script receives data from agents and runs web-interface, making restarts even for stylistic changes incovenient (all agents need to reconnect, data can be lost in the process)
dashboard script is difficult to scale
agents gather data only from eth client, however it might be helpful to gather other sorts of data (cpu usage, memory, network)
dashboard stores data in-memory, so there is no way to analyze historic entries

Specification

Agents and dashboard should be functionally separated and run as independent http services, easy to scale and restart
Agents should be able to gather statistics of different kinds, on different time intervals (to avoid overloading a node)
New statistics metrics should be easy to add, so they should be placed in separate modules (plugins) and written in easy-to-reuse platform-independent languages (e.g. javascript)
Agents should be able to send statistic to (multiple) different receivers (dashboard, external databases, ...)
Dashboard must work as one of possible receivers
Dashboard should consist of four parts: (1) endpoint to receive statistics from agents (2) database to store recent data for analysis (3) scripts to analyze data (4) web interface to display network state and statistics

Implementation

Agents proposed implementation is the following:
- each agent should be started with identification info (node full name, mining key, etc)
- main work should be done inside statistic-gathering plugins
- main process should either call plugin on specified configuratble timeout or load it on startup
- each plugin gathers statistics and emits it back to main process
- main process then checks which receivers are defined for the event and sends statistics to corresponding "sinks"
- each sink reformats the input data, adds authorization keys and sends it to receiver

main process should be written in such a way that uncaught exceptions that may occur inside any of the plugins, sinks or receivers do not lead to crash and do not influence other plugins

Examples of plugins are:

plugin that loads on startup, connects to eth client and waits for new blocks (this is the equivalent of current implementation)
plugin that periodically gathers cpu usage, memory utilization and network status (node health check)

Examples of receivers are:

dashboard
aws dynamodb (for storing block times and displaying info on poa.network main site)

Dashboard proposed implementation is the following:
- backend should only accept data from requests that send correct network-wide secret key
- backend should accept data on new blocks and health checks and store recent entries in database (e.g. redis)
- there must be scripts that analyze latest data and perform network-wide health checks (e.g. has any validator missed turn?)
- web interface should use websockets to establish connection with backend to receive updates as fast as possible
- web interface should display nodes list and network statistics (same way as now)
- nodes list should be filterable and sortable, with ability to pin certain nodes on top (same way as now)
- when a node is clicked on nodes list, some detailed info must be loaded from governance contracts, e.g. validator name, maybe link to introduction on forum (this should be configurable)

maratP commented 6 years ago

Some initial thoughts:

Dashboard sorting was already listed. Just want to add that sorting should have an option to show only validator nodes, only bootnodes, all nodes, etc.

Collect stats on the node. We may look into "collectd" deamon. It is one of the easiest ways to collect common metrics from nodes, but ALSO it has a plugin system which means we could expand it to our needs.

On receiving end, there would be aggregator/collector script that will collect data for some set period of time. Then pass it on / store it.

Since there would several (many) nodes that send data at the same time, there should be another script that performs a load balancing and working in parallel with collector above

We should look into using InfluxDB. It has an HTTP endpoint, so we could just POST data to it, from a shell command or within an application

There should be some logic to delete data that is not needed for historic viewing. For example we will keep block info and anything blockchain specific, but could delete CPU usage after N number of days.

Agents could be written in JS or Python or both. Several light weight scripts will perform their duties and send data to "collector".

maratP commented 6 years ago

InfluxDB post example:

curl -i -XPOST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE mydb"

Downside: Looks like InfluxDB is no longer actively developed..

maratP commented 6 years ago

Diamond is a python daemon that collects system metrics and publishes them to Graphite (and others). It is capable of collecting cpu, memory, network, i/o, load and disk metrics. Additionally, it features an API for implementing custom collectors for gathering metrics from almost any source.

https://github.com/python-diamond/Diamond

igorbarinov commented 6 years ago

@maratP Python is not in our stack at the moment.

Preferable language are be Elixir, Rust, Javascript(Node) in desc order

johnnynuuma commented 6 years ago

My understanding is that the currrent netstats consist of (2) components.

The fundamental issue is these components are tightly coupled. So how to decouple and make extensible in a reasonable way.

Hmmm, my thoughts ....

eth-netstats which is the Dashboard/frontend ( https://github.com/poanetwork/eth-netstats )

Since you are interested potentially displaying data from multiple independent data sources it may be prudent to look at a dashboard framework like this ( https://grafana.com/ ). NOTE: not an endorsement just a visualization tool for requirements. Is this kinda of what you are thinking?

eth-net-intelligence-api which is the agents/data providers that are deployed on each node ( https://github.com/poanetwork/eth-net-intelligence-api )

If you use a front end framework like above and publish/subscribe model things are decoupled. This also gives flexibility to when and where the data is stored persistently ( if ever ) as it will be deferred to the subscriber implementation. This also gives a lot of flexibility to the publisher implementation ( could be anything ) as long as the publisher/subscriber agree on data format. ( here I am assuming these "monitoring-jobs" still run locally on each node ). I suppose it would be nice these "monitoring-jobs":

satisfied some "monitoring-job" specification and lived in a lightweight framework that executed them ( gives elements of control, transparency and thus increases security )
had a management interface so they could managed remotely but this introduces security risks so would set this aside for now.

maratP commented 6 years ago

@igorbarinov, good to know about language preferences

John, agree on Graphana. My reasearch and what I described above also points to Graphana

6proof commented 6 years ago

Hello everyone--

most of the tools described here are already available in the great Open Source project Libre NMS: https://www.librenms.org/ Full, very active Github: https://github.com/librenms/librenms There is an active demo on the site, so take it for a spin. Libre NMS has full API, messaging and alerting systems, native iOS and Android apps, etc. Very robust, configurable for almost all uses. It has built-in hooks for collectd, rrd, and almost all of the Open Source standard monitoring, alerting and graphing tools.

6proof commented 6 years ago

Tools like Libre NMS - https://www.librenms.org - provide real time monitoring and notification, and create incredible historic graphs, allowing us to visually see patterns over time not apparent in snapshot images. The storage problem is solved, so no need to pick and choose with statistics to retain. They encourage forking and component adoption; good tool.

jflowers1974 commented 6 years ago

It does seem like one would need a db for the sheer amount of data points. One of my machines is running parity (pointing to core) and I could see then dumping into something like CockroachDB (or Spanner - then could one easily use google chart I believe?). Then have a custom dashboard as one sees fit. Oh another db option that I've used: fauna db - it's nice too.

6proof commented 6 years ago

RRDtool is built specifically to handle this sort of data in a stable, fixed sized database. Libre NMS (and most other monitoring tools) use this by default. Very stable, efficient tool; sort of the bedrock for monitoring systems for the past 20 years and the foreseeable future.

poanetwork / RFC