Closed 0xspeedybird closed 1 year ago
Yes please!
Having Loki as part of my monitoring and alerting setup has helped me immensely with tracking broadcaster/orchestrator/transcoder issues and makes it easier to provide detailed information when opening GH issues.
Making this easily accessible to all O/T's on the network would make the entire network more robust 👍
Running multiple nodes the task of monitoring them has become increasing more time consuming and difficult to keep on top of some of things I used to monitoring so closely specifically the log files. Additionally the current method of storing this data on each node is inefficient and requires more node resources than should be necessary for Livepeer operation. Having this data on a dedicated server freeing up Livepeer to nodes to focus on handling Livepeer work just makes sense.
To Marco's point above regarding providing detailed information for GH issues, this week I was asked to provide logs for an issue and it took me more than 30 mins to go through the piles of data to find what was requested. Having everything in a single place with the ability to run a query for the request would have been ideal.
If anything would make a life easier as an orchestrator it would Loki!!! would love to see it up and running!
This is an awesome idea! Lets get it!!!
Hey @0xspeedybird! We'd love to fund this, especially given the community support you received! I'm hansy#9576 on Discord. Let's set up a quick chat to align on scope, deliverables, and grant amount.
Hey @hansy, I know you were chatting with @mikezupper on this. He should be reaching out with an update. At the moment, the effort has been completed. A forum post with the details is here: https://forum.livepeer.org/t/guide-production-log-monitoring-and-alerting/2004
Give a 3 sentence description for your proposal.
Mike Zupper and I are expanding on the intermediate Livepeer education to strengthen the community and network.
Being able to monitor your Livepeer node is of the utmost importance for robust operation and increased profitability. It also ensures the health of the LIvepeer network. As a node operator, doing so normally requires jumping into various log files and dashboards to stitch together a cohesive view of node health. This proposal will deliver source code and a walk through that provides a one-stop shop for accessing logs and generating proactive notifications.
Describe the problem you are solving.
Most questions that come up when running Livepeer can be solved by reviewing the logs. It provides a quick view of whether the software is healthy or not. More advanced users might use metrics generated by Livepeer as well. Regardless of the question at hand (Is the software running? Why did it stop?), logs generated by Livepeer have to be looked over for any issues. All logs generated by the Livepeer process are available only when locally connected to the system running the Livepeer software. Once there, you might have to review multiple, detailed log lines across files. This can be tedious when needing to rapidly determine system health or diagnosing an issue. This is especially true when on a mobile device or when attempting to find specific log messages by date or other keywords.
Describe the solution you are proposing and how it will have a positive impact on the Livepeer developer ecosystem.
An open source approach to log aggregation and alerts using Docker, Loki, Promtail, Alert Manager (Prometheus) and Grafana. Promtail will retrieve logs from each node and send them to a Loki server. Loki will then store and expose this indexed log data for searching via Grafana. Grafana will then enable multiple metrics to be captured, searched, and reported on across all installed nodes. Loki and Prometheus will also be integrated with AlertManager to send alerts via Telegram based on key configurable conditions. Sample alerts and metrics/queries will be provided. This solution will set the groundwork to enable node operators to easily extend the solution to proactively manage their system as well as add additional alerts and notification channels (email, pagerduty, etc).
Describe why you are the right team with the capability to build this.
Mike Zupper and I run dozens of nodes. We also have provided support for other node operators to implement enterprise deployments. This leverages our decades of experience building software products from scratch and operating them at scale. The applications we have built have serviced millions of users at enterprise scale. This experience spans many deployment models, in private data centers and multi-region, multi-vendor cloud deployments. We have the know-how and experience to build a usable and reliable solution.
Describe the scope of the project including a 3 month timeline and milestones.
Phase 1 - Installation and Setup - 12/19/2022 - 1/10/2023
Phase 2 - Custom Queries and Dashboard - 1/11/2023 - 1/20/2023
Phase 3 - Integration with AlertsManager - 1/23/2023 - 1/31/2023
Key Deliverables
Please estimate hours spent on project based on the above and how much funding you will need.
Phase 1 - 40 hours
Phase 2 - 20 hours
Phase 3 - 30 hours
Total hours: 90 at $125/hr
Total cost: $11,250