Intermediate Guide to Livepeer Production Log Monitoring and Alerting

0xspeedybird commented 1 year ago

Give a 3 sentence description for your proposal.

Mike Zupper and I are expanding on the intermediate Livepeer education to strengthen the community and network.

Being able to monitor your Livepeer node is of the utmost importance for robust operation and increased profitability. It also ensures the health of the LIvepeer network. As a node operator, doing so normally requires jumping into various log files and dashboards to stitch together a cohesive view of node health. This proposal will deliver source code and a walk through that provides a one-stop shop for accessing logs and generating proactive notifications.

Describe the problem you are solving.

Most questions that come up when running Livepeer can be solved by reviewing the logs. It provides a quick view of whether the software is healthy or not. More advanced users might use metrics generated by Livepeer as well. Regardless of the question at hand (Is the software running? Why did it stop?), logs generated by Livepeer have to be looked over for any issues. All logs generated by the Livepeer process are available only when locally connected to the system running the Livepeer software. Once there, you might have to review multiple, detailed log lines across files. This can be tedious when needing to rapidly determine system health or diagnosing an issue. This is especially true when on a mobile device or when attempting to find specific log messages by date or other keywords.

Describe the solution you are proposing and how it will have a positive impact on the Livepeer developer ecosystem.

An open source approach to log aggregation and alerts using Docker, Loki, Promtail, Alert Manager (Prometheus) and Grafana. Promtail will retrieve logs from each node and send them to a Loki server. Loki will then store and expose this indexed log data for searching via Grafana. Grafana will then enable multiple metrics to be captured, searched, and reported on across all installed nodes. Loki and Prometheus will also be integrated with AlertManager to send alerts via Telegram based on key configurable conditions. Sample alerts and metrics/queries will be provided. This solution will set the groundwork to enable node operators to easily extend the solution to proactively manage their system as well as add additional alerts and notification channels (email, pagerduty, etc).

Describe why you are the right team with the capability to build this.

Mike Zupper and I run dozens of nodes. We also have provided support for other node operators to implement enterprise deployments. This leverages our decades of experience building software products from scratch and operating them at scale. The applications we have built have serviced millions of users at enterprise scale. This experience spans many deployment models, in private data centers and multi-region, multi-vendor cloud deployments. We have the know-how and experience to build a usable and reliable solution.

Describe the scope of the project including a 3 month timeline and milestones.

Phase 1 - Installation and Setup - 12/19/2022 - 1/10/2023

Develop the installation process and document
- Loki, Promtail installation via Docker Compose
- Loki, Promtail configuration to pull and index logs files
- Using Loki to query log using filters and labels to visualize in Grafana
- Grafana configuration via Docker compose to leverage Loki

Phase 2 - Custom Queries and Dashboard - 1/11/2023 - 1/20/2023

Dashboard with key metrics from Loki to enable health monitoring at a glance
- Several common errors/warnings
- Blockwatch status to monitor for possible drift

Phase 3 - Integration with AlertsManager - 1/23/2023 - 1/31/2023

AlertManger installation via Docker compose
AlertManager configurations for Prometheus metrics
AlertManager configurations for Loki
Telegram Bot setup to receive alerts generated by AlertManager

Key Deliverables

Create Livepeer forum post outlining:
- Setup of Promtail, Loki, and AlertManager
- Examples of queries and executing them in Grafana
- Configuration of AlertManager to send alerts via Telegram
GitHub repo of code included

Please estimate hours spent on project based on the above and how much funding you will need.

Phase 1 - 40 hours

Phase 2 - 20 hours

Phase 3 - 30 hours

Total hours: 90 at $125/hr

Total cost: $11,250

AuthorityNull commented 1 year ago

Yes please!

stronk-dev commented 1 year ago

Having Loki as part of my monitoring and alerting setup has helped me immensely with tracking broadcaster/orchestrator/transcoder issues and makes it easier to provide detailed information when opening GH issues.

Making this easily accessible to all O/T's on the network would make the entire network more robust 👍

papabear99 commented 1 year ago

Running multiple nodes the task of monitoring them has become increasing more time consuming and difficult to keep on top of some of things I used to monitoring so closely specifically the log files. Additionally the current method of storing this data on each node is inefficient and requires more node resources than should be necessary for Livepeer operation. Having this data on a dedicated server freeing up Livepeer to nodes to focus on handling Livepeer work just makes sense.

To Marco's point above regarding providing detailed information for GH issues, this week I was asked to provide logs for an issue and it took me more than 30 mins to go through the piles of data to find what was requested. Having everything in a single place with the ability to run a query for the request would have been ideal.

Pon-node commented 1 year ago

If anything would make a life easier as an orchestrator it would Loki!!! would love to see it up and running!

RyanC92 commented 1 year ago

This is an awesome idea! Lets get it!!!

hansy commented 1 year ago

Hey @0xspeedybird! We'd love to fund this, especially given the community support you received! I'm hansy#9576 on Discord. Let's set up a quick chat to align on scope, deliverables, and grant amount.

0xspeedybird commented 1 year ago

Hey @hansy, I know you were chatting with @mikezupper on this. He should be reaching out with an update. At the moment, the effort has been completed. A forum post with the details is here: https://forum.livepeer.org/t/guide-production-log-monitoring-and-alerting/2004

livepeer / grants

Intermediate Guide to Livepeer Production Log Monitoring and Alerting #105