lstrojny / homebridge-prometheus-exporter

Prometheus metrics exporter for homebridge accessories
Apache License 2.0
38 stars 1 forks source link

Counter metrics #272

Open arionl opened 1 year ago

arionl commented 1 year ago

The HomeBridge Prometheus exporter is excellent. However, I have a question that is probably related to its architecture, or the architecture of HomeBridge in general. From what I can see, all metrics are exposed as Guages - they are single snapshot metrics. For example, a temperature sensor reports whatever value it has as the time of scraping. This works great for most sensor types. However, there are some other use cases where other types of metrics could be useful. This is partially discussed in issue #34.

For sensors that report "on" and "off" status, it would be useful if there was any way to expose a Prometheus Counter metric for these sensors. This metric would just be an incrementing count of the number of events that have occurred for the sensor. In this case, it would be the number of "on" and "off" events total, each as separate metrics. This could be tracked from whenever the exporter was started (since most people will be running it long-term - weeks at a time between restarts). If this were made available, it would then be possible to measure the number of events over time easier. This could lead to graphs like "number of times a motion sensor was triggered in the last hour". I can see a lot of cases where this would be useful. You could write Prometheus rules to report on patterns (if motion sensor has >1 event between 12am-5am, start a workflow via AlertManager).

For the most simple case, if this functionality were available, I could see how many times per hour my kids open the door to the refrigerator :) Just a thought and figured I'd capture it in an Issue here to see if this would be possible. Thanks!

lstrojny commented 1 year ago

Thank you for the suggestion.

To share a bit of background: the exporter queries the homebridge state every refresh_interval (default is 60) seconds and publishes metrics found. So the way in which we could count events is to detect if any change has happened since the last poll. That would mean that a pair of on/off events that happened inside of the 60 second timeframe would be missed. That could be an alright limitation though.

On the topic of preserving the counter: since plugins are allowed to store persistent data, it wouldn't be a problem to store the counters, so that they would survive a restart.

Since I probably won’t have enough time or motivation to pick this up any time soon, let me document how that could work so you @arionl or anybody else interested could pick this up.

It all starts in platform.ts startHapDiscovery:

    private startHapDiscovery(): void {
        this.log.debug('Starting HAP discovery')
        discover({ log: this.log, config: this.config })
            .then((devices) => {
                const metrics = aggregate(devices, new Date())
                this.httpServer.onMetricsDiscovery(metrics)
                this.log.debug('HAP discovery completed, %d metrics discovered', metrics.length)
                this.startHapDiscovery()
            })
            .catch((e) => {
                this.log.error('HAP discovery error', e)
            })
    }

It invokes the discovery process and once the devices have been received it aggregates the device metrics. So this should probably change to something like this:

    private startHapDiscovery(): void {
        this.log.debug('Starting HAP discovery')
        discover({ log: this.log, config: this.config })
            .then((devices) => {
                const previousDevices = readPersistedDevicesState();

                const metrics = aggregate(devices, previousDevices, new Date())
                this.httpServer.onMetricsDiscovery(metrics)

                writeDevicesState(devices)

                this.log.debug('HAP discovery completed, %d metrics discovered', metrics.length)
                this.startHapDiscovery()
            })
            .catch((e) => {
                this.log.error('HAP discovery error', e)
            })
    }

The the aggregate function in metrics.ts would need to be changed to pass the previous service to the extractMetrics function (also in metrics.ts) and extractMetrics would need to change to e.g. create count metrics for e.g. for all On characteristics and compare them to the previous service state. Based on convention, if the metric name ends with _total it is published as a counter (see MetricsRenderer.render in prometheus.ts. So yes, it’s a bit of work but certainly doable.

lstrojny commented 1 year ago

P.S.: a good way to drive the change is to modify the aggregator.test.ts

lstrojny commented 1 year ago

P.P.S.: the hombridge verified conventions dictate this

If the plugin needs to write files to disk (cache, keys, etc.), it must store them inside the Homebridge storage directory.

So this should be adhered to when persisting the previous state.

arionl commented 1 year ago

To share a bit of background: the exporter queries the homebridge state every refresh_interval (default is 60) seconds and publishes metrics found. So the way in which we could count events is to detect if any change has happened since the last poll. That would mean that a pair of on/off events that happened inside of the 60 second timeframe would be missed. That could be an alright limitation though.

If HomeBridge isn't already natively counting the events, I think that's a blocker to doing anything more with a counter metric. The point is to get high accuracy on the number of events over time, so missed events would be detrimental. Most network device counter metrics work like this (i.e., routers and switches, which use 64 bit counters that increments for bytes sent/received). It sounds like the first step may be to review the HomeBridge code to see if adding event counters to contact/switch-like device classes would be difficult. I still believe that these counters could work as transient data (i.e., not persisted), because HomeBridge (and plugins) are intended to be long running processes. Slightly more effort would be to just write the state of the counters at each clean stop/restart event.

I really appreciate the reply and some pseudo code to review! My NodeJS/JavaScript skills are poor (I hack on Python once in a while, but that's about it). If you want to close this ticket out, that's fine. If either of us get motivation to try to tackle this, we can always open a new ticket and this one can remain for legacy/discussion. Thanks again!