I've introduced some operational metrics to track the health of the provider and associated collectors. These metrics can be used to understand the health of the provider/collectors and give insight into how each collector is performing, independantly.
The basic structure for the operational metrics is:
cloudcost_exporter_scrapes_totals
Should always be increasing. If it ever remains stagnant, there's a
problem with the provider
cloudcost_exporter_last_scrape_error
Should always be one. Alert when value is 0 or non existent for >5
minutes
cloudcost_exporter_last_duration_seconds
Guage for how long it took the total provider to collect metrics from
each collector. Alert if > 1m for extended period of time
cloudcoxt_exporter_collector_duration_seconds
Guage for how long each collector took. Helps triage and understand how
each collector is performing
If one collector takes significantly longer, you can run the "offender"
on a dedicated exporter
cloudcoxt_exporter_collector_last_scrape_error
Guage to track each collector's last run. 1 if it's fine, 0 to indicate
an error
I've introduced some operational metrics to track the health of the provider and associated collectors. These metrics can be used to understand the health of the provider/collectors and give insight into how each collector is performing, independantly.
The basic structure for the operational metrics is:
cloudcost_exporter_scrapes_totals
cloudcost_exporter_last_scrape_error
cloudcost_exporter_last_duration_seconds
cloudcoxt_exporter_collector_duration_seconds
cloudcoxt_exporter_collector_last_scrape_error
Example operational dashboard: