nerves-hub / nerves_hub_web

Manage firmware updates for Nerves devices
https://nerves-hub.org/
Apache License 2.0
190 stars 69 forks source link

Feature request: Logs for penalty box state changes #1478

Open rraub opened 2 months ago

rraub commented 2 months ago

Describe the feature I want to improve the observability of devices making their way into and out of the penalty box. Our standard info-level logs are a simple way to achieve this. The volume should be low, so I have no capacity concerns with this recommendation.

The problem I'm trying to help address is the automated monitoring around when we make changes that push firmware out to a large number of devices. These logs would let us generate metrics so we could end up with a graph over time of devices put in the box. They would also provide useful debugging context to other tools outside of Nerves Hub (assuming they can search the nerves hub logs) that can help highlight why a device is not receiving its expected firmware updates.

We currently have these types of connect/disconnect logs:

19:03:25.932 [info] pid=<0.1078404.0> mfa=NervesHub.DeviceReporter.handle_event/4 identifier=XXX event=nerves_hub.devices.disconnect ref_id=XX Device disconnected
17:45:44.247 [info] pid=<0.8824444.0> mfa=NervesHub.DeviceReporter.handle_event/4 identifier=XXX event=nerves_hub.devices.connect firmware_uuid=XXX Device connected

We could build off of this model and introduce additional events: nerves_hub.devices.penaltybox.in nerves_hub.devices.penaltybox.out

Additional context Bonus points if we can include some reasoning in the logs (did someone manually select to move them in/out or was this an automatic action based on thresholds)

joshk commented 1 month ago

I'm sorry for my delayed reply. I was pondering over this a bit and had an idea.

I agree we could log more, and that is a quick win. I also think we should add some more telemetry, which is another quick win.

But the bigger idea I had was adding a 'key' to the Audit Logs table, which could allow us to show metrics based on audit log events across a product.

I need to play this out more. I essentially want to see more of this data in the UI so you can see spikes quickly, without having to hunt for this info. I'd also like to see some alerting too, most likely to Slack, so you can be warned of these issues as they start to appear.