nathanielc / morgoth

Metric anomaly detection
http://docs.morgoth.io
Apache License 2.0
280 stars 31 forks source link

how to deal with "spikey" data #60

Open rreilly-edr opened 6 years ago

rreilly-edr commented 6 years ago

Hi, I have some data that is very spikey like ( I am sure there is a statistical term for this maybe not normal) image

if I use the example on the README, i get non stop alerts. like image

I have tweaked the two paramters for errorTolerence and Minimum support but I etiher get a lot or alertsor no alerts here is an example of my morgoth kapacitor tick. I am collecting my metrics every 10 seconds i used a 15 min window to make sure i am getting enough data.

dbrp "statsd"."autogen"

stream
    |from()
        .measurement('load_avg_five')
        .groupBy('host')
    |window()
        .period(15m)
        .every(1m)
    @morgoth()
        .field('value')
        .anomalousField('anomalous')
        .errorTolerance(0.01)
        .minSupport(0.05)
        .sigma(3.0)
    |alert()
        .message('{{ .Level}}: {{ .Name }}/{{ index .Tags "host" }} anomalous')
        .crit(lambda: "anomalous")
        .log('/tmp/malerts.log')
        .sensu()
        .slack()

I would like to get no alerts unless i put a lot of load on the system. Thanks ! rob

nathanielc commented 6 years ago

Those spikes occur at very regular intervals. Its hard to tell the time period from the image. I would recommend figuring out what the time is between each spike and use a window that is a multiple of that time. For example if those spike are occurring every 10m then use a 10m or 20m window. If the spikes are occurring every 15m use a 15m or 30m window etc.

But using a window that is not a multiple will mean that the spike is sometimes in the window and sometimes not and sometimes the spike will be partly in the window and partly not. As a result this adds artifacts to the data that are not really there, since there are a inifinite number of ways for the spike to be partially in the window. This is called aliasing if you want to learn more.

rreilly-edr commented 6 years ago

@nathanielc yes sorry for the lack of zoom on that image here is another example: image it looks like about every two hours, i will try to go to 150minutes i have a small set of servers now so i will test in a stream but i will need to convert to batch. I will try and report back thanks !

nathanielc commented 6 years ago

Sounds good. Also if you are not aware you can use Kapacitor's replay-live feature to replay the historical data to either stream or batch tasks, so you don't have to wait 150m a bunch of times.

See kapacitor replay-live query

But it does make sense to convert to a batch task with a window that large.

rreilly-edr commented 6 years ago

@nathanielc i have converted my stream to batch does this look correct for batch ? I have not done any batch tasks yet

dbrp "statsd"."autogen"

batch
    .query('''
     select host,value 
     from "statsd"."autogen"."load_avg_five"
    ''')
        .period(150m)
        .every(1m)
        .groupBy(time(10s), 'host')
    @morgoth()
        .field('value')
        .anomalousField('anomalous')
        .errorTolerance(0.01)
        .minSupport(0.05)
        .sigma(3.0)
    |alert()
        .message('{{ .Level}}: {{ .Name }}/{{ index .Tags "host" }} anomalous')
        .crit(lambda: "anomalous")
        .log('/tmp/malerts.log')
        .sensu()
        .slack()