nathanielc / morgoth

Metric anomaly detection
http://docs.morgoth.io
Apache License 2.0
280 stars 31 forks source link

Debugging Morgoth in Kapacitor #35

Closed elvarb closed 8 years ago

elvarb commented 8 years ago

Setting the -log-level debug argument in Kapacitor does not work and just gives an error, also if I have my alerts to also log info nothing comes out. Is there some way to see what is going on internally in Morgoth?

var cpu = stream
    |from()
        .measurement('win_cpu')
        .groupBy('host','domain')
    |window()
        .period(5m)
        .every(5m)
        .align()
    |default()
        .field('% Processor Time', 0.0)
  @morgoth()
    .field('% Processor Time')
    .scoreField('anomalyScore')
    .sigma(3.5)

cpu
    |alert()
        .info(lambda: 'anomalyScore' > 0.0)
        .warn(lambda: 'anomalyScore' > 1.0)
        .crit(lambda: 'anomalyScore' > 3.0)
        .log('/tmp/cpu_morgoth.log')
nathanielc commented 8 years ago

@elvarb Can you share you Kapacitor config for how you setup Morgoth?

What error are you getting?

elvarb commented 8 years ago
[udf]
# Configuration for UDFs (User Defined Functions)
[udf.functions]
    [udf.functions.morgoth]
      prog = "/opt/morgoth/morgoth"
      timeout = "10s"

This is the config.

I'm getting no errors at all.

when I do kapacitor show cpu_morgoth I get this, so something is happening at least

DOT:
digraph cpu_morgoth {
graph [throughput="0.00 points/s"];

stream0 [avg_exec_time_ns="0" ];
stream0 -> from1 [processed="3577"];

from1 [avg_exec_time_ns="87.5µs" ];
from1 -> window2 [processed="3577"];

window2 [avg_exec_time_ns="45.85µs" ];
window2 -> default3 [processed="168"];

default3 [avg_exec_time_ns="66.4µs" fields_defaulted="2796" tags_defaulted="0" ];
default3 -> morgoth4 [processed="168"];

morgoth4 [avg_exec_time_ns="156.498µs" ];
morgoth4 -> alert5 [processed="2"];

alert5 [alerts_triggered="0" avg_exec_time_ns="0" crits_triggered="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" ];
}
nathanielc commented 8 years ago

Add the -log-level as an arg to the morgoth command.

[udf]
# Configuration for UDFs (User Defined Functions)
[udf.functions]
    [udf.functions.morgoth]
      prog = "/opt/morgoth/morgoth"
      args = ["-log-level", "debug"]
      timeout = "10s"

Then the logs will show up with the Kapacitor logs.

elvarb commented 8 years ago

I screwed up the args before, had the line like this args = ["-log-level debug"]

Now I'm getting debug information in the log.

After letting Kapacitor and morgoth run on a cpu metric for a long long time I got 22 info hits where info was defined as .info(lambda: TRUE)

Even thousands of events were sent to Morgoth it only returned 22 events as suspicious. The Anomaly score was 0.9 of those events. Is this normal or should Morgoth be sending everything it finds down the chain?

stream0 [avg_exec_time_ns="0" ];
stream0 -> from1 [processed="350"];

from1 [avg_exec_time_ns="6.7µs" ];
from1 -> window2 [processed="350"];

window2 [avg_exec_time_ns="4.7µs" ];
window2 -> default3 [processed="6"];

default3 [avg_exec_time_ns="0" fields_defaulted="177" tags_defaulted="0" ];
default3 -> morgoth4 [processed="6"];

morgoth4 [avg_exec_time_ns="2.8µs" ];
morgoth4 -> alert5 [processed="0"];

alert5 [alerts_triggered="0" avg_exec_time_ns="0" crits_triggered="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" ];
elvarb commented 8 years ago

I'm monitoring multiple cpu metrics and this is the debug information in the logs

[cpu_morgoth:morgoth4] 2016/09/05 14:06:04 I!P 2016/09/05 14:06:04 D! *sigma.Sigma anomalous? false support: 0.500000
[cpu_morgoth:morgoth4] 2016/09/05 14:07:04 I!P 2016/09/05 14:07:04 D! *sigma.Sigma anomalous? false support: 1.000000
[cpu_morgoth:morgoth4] 2016/09/05 14:07:04 I!P 2016/09/05 14:07:04 D! *sigma.Sigma anomalous? false support: 1.000000
[cpu_morgoth:morgoth4] 2016/09/05 14:07:04 I!P 2016/09/05 14:07:04 D! *sigma.Sigma anomalous? false support: 1.000000
[cpu_morgoth:morgoth4] 2016/09/05 14:08:04 I!P 2016/09/05 14:08:04 D! *sigma.Sigma anomalous? false support: 1.000000
[cpu_morgoth:morgoth4] 2016/09/05 14:08:04 I!P 2016/09/05 14:08:04 D! *sigma.Sigma anomalous? false support: 1.000000
[cpu_morgoth:morgoth4] 2016/09/05 14:12:04 I!P 2016/09/05 14:12:04 D! *sigma.Sigma anomalous? false support: 1.000000
[cpu_morgoth:morgoth4] 2016/09/05 14:12:04 I!P 2016/09/05 14:12:04 D! *sigma.Sigma anomalous? false support: 1.000000
[cpu_morgoth:morgoth4] 2016/09/05 14:12:04 I!P 2016/09/05 14:12:04 D! *sigma.Sigma anomalous? false support: 1.000000
[cpu_morgoth:morgoth4] 2016/09/05 14:12:04 I!P 2016/09/05 14:12:04 D! *sigma.Sigma anomalous? false support: 1.000000
[cpu_morgoth:morgoth4] 2016/09/05 14:12:04 I!P 2016/09/05 14:12:04 D! *sigma.Sigma anomalous? false support: 1.000000
[cpu_morgoth:morgoth4] 2016/09/05 14:12:04 I!P 2016/09/05 14:12:04 D! *sigma.Sigma anomalous? false support: 0.666667
[cpu_morgoth:morgoth4] 2016/09/05 14:16:04 I!P 2016/09/05 14:16:04 D! *sigma.Sigma anomalous? false support: 1.000000
[cpu_morgoth:morgoth4] 2016/09/05 14:17:04 I!P 2016/09/05 14:17:04 D! *sigma.Sigma anomalous? false support: 1.000000

The timestamps there do not match the pattern of the metrics its processing, so what triggers a debug log write?

What do the numbers mean at the end?

nathanielc commented 8 years ago

@elvarb Answers to your questions below...

Even thousands of events were sent to Morgoth it only returned 22 events as suspicious. The Anomaly score was 0.9 of those events. Is this normal or should Morgoth be sending everything it finds down the chain?

Morgoth only sends the anomalous events down the chain. You can use join to join the original data back up with the anomalous score if need be. As for only getting 22 events out of thousands that is initially at least a good sign that Morgoth is not spewing false positives ;)

The timestamps there do not match the pattern of the metrics its processing, so what triggers a debug log write?

The timestamps on the logs are from the system time. Kapacitor can process points for any time period. How far off are the timestamps? If you are sending the data in real time then they should be close but may not match up exactly.

What do the numbers mean at the end?

Support is from the Lossy Counting algorithm, it is essentially saying how frequently this kind of window has been seen before. A larger number means it is more frequent aka normal, and a smaller number means it is less frequent and potentially anomalous. The default minimum support is 0.05, meaning if a window has a support greater than 0.05 it is considered normal otherwise it is anomalous. The support can be interpreted as a percentage and ranges from 0-1. For example a minimum support of 0.05 means events less than 5% frequent are anomalous.

The minimum support can be set via the minSupport property:

var cpu = stream
    |from()
        .measurement('win_cpu')
        .groupBy('host','domain')
    |window()
        .period(5m)
        .every(5m)
        .align()
    |default()
        .field('% Processor Time', 0.0)
  @morgoth()
    .field('% Processor Time')
    .scoreField('anomalyScore')
    .sigma(3.5)
    // set minimum support
   .minSupport(0.1)

cpu
    |alert()
        .info(lambda: 'anomalyScore' > 0.0)
        .warn(lambda: 'anomalyScore' > 1.0)
        .crit(lambda: 'anomalyScore' > 3.0)
        .log('/tmp/cpu_morgoth.log')

Hope that helps.

elvarb commented 8 years ago

I understand this much better now!

The modified tick script looks like this

var cpu = stream
    |from()
        .measurement('win_cpu')
        .where(lambda: "% Processor Time" >= 0.0)
        .groupBy(time(1m), 'host', 'domain')
    |window()
        .period(5m)
        .every(1m)
    @morgoth()
        .field('% Processor Time')
        .scoreField('anomalyScore')
        .minSupport(0.1)
        .sigma(2.0)

cpu
    |alert()
        .info(lambda: TRUE)
        .warn(lambda: 'anomalyScore' > 1.0)
        .crit(lambda: 'anomalyScore' > 3.0)
        .log('/tmp/cpu_morgoth.log')

The major changes I did was to add a where line to only select data if that metric value is used. The default setting I had before was throwing the detection way off. Also there is a groupby time setting because the metrics are only coming in every minute.

Setting minsupport meant that more would be passed through Morgoth.

The stats now make more sense also

stream0 [avg_exec_time_ns="0" ];
stream0 -> from1 [processed="4289"];

from1 [avg_exec_time_ns="73.125µs" ];
from1 -> window2 [processed="857"];

window2 [avg_exec_time_ns="31.212µs" ];
window2 -> morgoth3 [processed="351"];

morgoth3 [avg_exec_time_ns="278.18µs" ];
morgoth3 -> alert4 [processed="27"];

Next step for me is to write the anomaly score back to InfluxDB so its easier for me to see the relationship between the metrics and the score. Thanks for your help.