Maintain datadog monitors via code

omgnetwork / elixir-omg

OMG-Network repository of Watcher and Watcher Info

https://omg.network

Apache License 2.0

211 stars 59 forks source link

Maintain datadog monitors via code #1673

Open boolafish opened 4 years ago

boolafish commented 4 years ago

Why

We are having more and more alarms/metric/monitors in data dog now. Currently it is a messy manual process. Also, there are 2 possible ways to trigger an alarm. Could be via application code directly or via data dog monitor.

We have some alarm that is directly triggered in application code. However, it would be great if all alarm can be just in data dog and the application only need to emit metric. Currently for most cases we are emitting both metric and alarm via application code.

But even not doing the simplification, using some code as infra to maintain our current messy monitors might be already beneficial enough.

Note

One potential tool is terraform: https://www.terraform.io/docs/providers/datadog/r/monitor.html Seems like supporting datadog too.

InoMurko commented 4 years ago

Just a note: you're using the word alarm, which has two definitions:

Application alarm that allows the application to be reactive.

Metric alarm.

boolafish commented 4 years ago

My intention is [alarm_word_placeholder] that would notify engineer to react on. I think datadog use the word monitor. Just I am used to use the word alarm previously 😅

unnawut commented 4 years ago

Just to expand Ino's comment a bit. The in-app alarms right now are mostly the ones that signals that the service is not in good shape to accept new transactions, e.g. :boot_in_progress, :ethereum_connection_error, :ethereum_stalled_sync, :invalid_fee_source, :main_supervisor_halted. And so when these ones are raised, new transactions are rejected immediately. So I think they'll have to remain in the application? And the rest of them, that needs human's help are already in datadog.

The only one that seems out of place is :system_memory_too_high which is being removed in https://github.com/omgnetwork/elixir-omg/pull/1678#issuecomment-667961902

boolafish commented 4 years ago

If we want to simplify a bit and standardize everything, we can make those :boot_in_progress metric to DG as well. In app they still handles the original logic to react to those but instead of sending alarm directly, it sends metric instead. So we have maintained list of monitors/alarms in a single place to see what we have.

But....might not bring much benefit as we already have those code in place and is working fine. Just some burden to be able to see what are all the monitors we have.

unnawut commented 4 years ago

Ah I see. I think we can send an increment metric for each of the event above, then we can start to see their frequency patterns. The alarm would also be treated more properly by setting alarm threshold to > 0 for each metric. Right now they're sort of reported as discontinuous events.

But I concur that it's not high priority given other things we have in hand at the moment.

boolafish commented 4 years ago

lol my personally feeling is nobody would take this up at all hahaha unless gold team board become cleaner 😅

But can we pick this up in new services like chch v2 to be a standard approach for new service ?! @InoMurko @achiurizo