Description:
As we improve our monitoring and alerting capabilities the next logical addition is collecting node based metrics. Collecting these types of metrics allows PLG and us to monitor and alert on memory/cpu/disk/etc spikes. This will provide the team more concrete methods to identify and debug performance based issues for the TDP ecosystem.
Acceptance Criteria:
[ ] Node exporter integrated with all backend apps (potentially frontend apps to if we can get it to work)
[ ] Common alerts for CPU, RAM, etc implemented
[ ] Alertmanager (if deployed) is updated/re-deployed to take into account new alerting config
[ ] Testing Checklist has been run and all tests pass
[ ] README is updated, if necessary
Tasks:Create a list of granular, specific work items that must be completed to deliver the desired outcomes of this issue
[ ] Integrate node exporter locally
[ ] Update Prometheus configs to scrape node metrics
[ ] Update Alertmanager configs to alert on node metrics
[ ] Update backend/frontend deploy manifests to include the downloading/deploying the node exporter binary
[ ] Update deployment script network config to let Prometheus scrape the node exporter metrics
[ ] Run Testing Checklist and confirm all tests pass
Notes:Add additional useful information, such as related issues and functionality that isn't covered by this specific issue, and other considerations that will be helpful for anyone reading this
Note 1
Note 2
Note 3
Supporting Documentation:Please include any relevant log snippets/files/screen shots
Doc 1
Doc 2
Open Questions:Please include any questions or decisions that must be made before beginning work or to confidently call this issue complete
Description: As we improve our monitoring and alerting capabilities the next logical addition is collecting node based metrics. Collecting these types of metrics allows PLG and us to monitor and alert on memory/cpu/disk/etc spikes. This will provide the team more concrete methods to identify and debug performance based issues for the TDP ecosystem.
Acceptance Criteria:
Tasks: Create a list of granular, specific work items that must be completed to deliver the desired outcomes of this issue
Notes: Add additional useful information, such as related issues and functionality that isn't covered by this specific issue, and other considerations that will be helpful for anyone reading this
Supporting Documentation: Please include any relevant log snippets/files/screen shots
Open Questions: Please include any questions or decisions that must be made before beginning work or to confidently call this issue complete