multiversx / mx-chain-go

⚡ The official implementation of the MultiversX blockchain protocol, written in golang.
https://multiversx.com
GNU General Public License v3.0
929 stars 202 forks source link

Infrastructure Observability #2813

Closed aWN4Y25pa2EK closed 3 years ago

aWN4Y25pa2EK commented 3 years ago

Hi ELROND Team, As a write-up recommendation to help the Validators network in terms of monitoring of the nodes, I tought it could be ideal if you could export the service metrics for prometheus [1] this way the Validators can scrape them to be ingested by a Prometheus server with the dashboards created in Grafana Cloud (Free Plan).

Metrics Once the services will export the metrics it will enable the scrappers to pull them to the Prometheus Server, then the Prometheus server will act as a datasource for the Grafana Cloud service where the dashboards can be created.

Logs The logs can be shipped by the Loki agent to a Loki ingester [2], the ingester will then act as a Loki datasource for the Grafana Cloud service.

One consideration about that kind of observability pull model would be that the TCP port for the metrics exporter must be available for the Prometheus scrapers but the Validators could have that installed either locally and then whitelisted by the firewall.

Once everything is set in Grafana then for the Alert thresholds you can easily integrate Pagerduty which can also enable you to have webhooks to third party services like Slack MSTeams or Telegram.

In terms of security Grafana provides an ip range list which can be whitelisted on the systems.[3]

observability

References

[1] https://prometheus.io/docs/guides/go-application/ [2] https://grafana.com/docs/loki/latest/architecture/ [3] https://grafana.com/api/hosted-grafana/source-ips

aWN4Y25pa2EK commented 3 years ago

Quick update:

After running a test node I can confirm that there is no need to open any additional TCP/UDP ports as both Prometheus and Loki endpoints are managed by Grafana cloud.

On the local node you must install Promtail and Prometheus set with a remote_write (push)

iulianpascalau commented 3 years ago

Hello, we did not try nor recommend a specific set of monitoring tool as for the current time. We are working on a custom tool for the node health that will interpret elrond-specific values read from the metrics to determine if a node has a problem or not. Anyone can write a tool that could pass the data from the elrond node to the prometheus agent.