pennsignals / legacy-system-services

2 stars 0 forks source link

Create alarms #7

Closed darrylmendillo closed 4 years ago

darrylmendillo commented 4 years ago

Create alerts with event driven and batch driven applications stop creating stdout logs.

Sev 1 alerts driven by application status

Event Driven:

Batch Driven

Use Alertmanager to create alerts for:

darrylmendillo commented 4 years ago

Create Sev 1 alarms

Create recovered (back online status)

Mdraugelis commented 4 years ago

Set up two thresholds for Signals and VentCue systems by monitoring the primary Mongo collections. flowsheet_metrics, location_metrics, cue_metrics, mar_metrics

  1. (Sev 1) Alert when Metrics is below the min_threshold for 15 minutes. min_threshold (default 1 essential)
  2. (Recovery) Alert when Metrics go from below threshold to above threshold 15 minutes.

Potential approach Set up grafana to monitor the number of events written to the following collections:

mon_db.collection_names()
'flowsheet_metrics', 'location_metrics', 'cue_metrics', 'mar_metrics'

Example: image

darrylmendillo commented 4 years ago

Grafana does not natively support MongoDB as a data source.

Solutions:

Mdraugelis commented 4 years ago

Got it. Based on the conversation this morning. I'll remove the notes on Mongo and update to "logs".

darrylmendillo commented 4 years ago

Created two sets of criteria for the above applications. Each event is evaluated every minute

  1. Event Driven:
    
    # data_rate = events/15min
    # 1 event / 15min = 0.0011 events/sec

alert_rate = 0.001

execute every minute

data_rate = len(samples_total_15m) / (60 * 15)

if (max(data_rate) == None) or min(data_rate) < alert_rate: pending = true pending_time+= 1 else: pending = false pending_time = 0

has been pending for 15 minutes

if pending and pending_time >= 15: fire_alert()


2. Batch:

data_rate = events/24hr

1 event / 24hr = 0.000012 events/sec

alert_rate = 0.00001

execute every minute

data_rate = len(samples_total_24hr) / (60 60 24)

if (max(data_rate) == None) or min(data_rate) < alert_rate: pending = true pending_time+= 1 else: pending = false pending_time = 0

has been pending for 120 minutes

if pending and pending_time >= 120: fire_alert()