influxdata / kapacitor

Open source framework for processing, monitoring, and alerting on time series data
MIT License
2.31k stars 492 forks source link

Bug on outer join data (caching issue?) #1754

Open dkuk opened 6 years ago

dkuk commented 6 years ago

I am having irregular data points - error log with meric like this:

error_log,host=cache-base,severity=warning error="illegal block size" <timestamp>

As the errors data - irregular sequence, I should use somethimg like base signal to trigger alert back to OK states.

Documentation for join node says that:

fill('null') null - fill missing points with null, full outer

So I have tick with join of base-signal (system mesurement) and irregular data - error_log measurement.

Here is simplified sample:

dbrp "telegraf"."autogen"

var db = 'telegraf'
var rp = 'autogen'
var idVar   = '{{ .TaskName }}:{{ index .Tags "host" }}'
var message = '{{ .Level }}: {{ index .Tags "host" }}. To many warnings count: {{ index .Fields "num_errors"}}.'
var period = 60m
var repeat = 1m
var critLevel = 100

var cache_servers = batch
     |query('SELECT last("load1") as "load_level" FROM "telegraf"."autogen"."system" WHERE "host" =~ /cache/')
         .period(period)
         .every(repeat)
         .groupBy('host')
         .align()

var errors = batch
     |query('SELECT count("error") as "num_errors" FROM "telegraf"."autogen"."error_log" WHERE "severity" = \'warning\' AND "host" =~ /cache/')
         .period(period)
         .every(repeat)
         .groupBy('host')
         .align()
         .fill('none')

 var data = cache_servers
    |join(errors)
        .as('servers','errors')
        .tolerance(1m)
        .fill('null')
    |default()
         .field('errors.num_errors', 0)
    |eval(lambda: "errors.num_errors")
        .as('num_errors')
    |alert()
        .crit(lambda: "num_errors" > critLevel)
        .id(idVar)
        .message(message)
        .topic('warnings-state')
        .log('/tmp/kapacitor_alerts.log')

Alert fails into critical state fine, but never goes back to OK state before kapacitor process stoped. When no any data comes from errors batch, join would not be complete ever! Looks like this issue is similar with #1704

Such behaviour of outer join absolutely discouraging! Depending on documentation, probably, outer join should not wait for data from the right part of join beyond tolerance(period) when fill('null') property used.

I've crooked walkaround for this simplified case with reverse deadman switch based on stats with derivative , but it is almost impossible to use this walkaround for other cases when I am trying to observe some complex processes trough the couple of services.

helotpl commented 5 years ago

I have same problem, please solve it or allow some option to join node, for us to be able to specify max buffering time for each point.

helotpl commented 5 years ago

Is there any way to add option to push buffered points out if they are still in buffer after specified time? For example:

   |join(errors)
        .as('servers','errors')
        .tolerance(1m)
        .fill('null')
        .maxWait(10m)

I can't think of other way around this problem. I'm trying to join two snmp tables:

name: pre-adva-txrx
time                 index     frRcv2FacilityCurrent15minBytes frTrmt1FacilityCurrent15minBytes
----                 -----     ------------------------------- --------------------------------
2019-04-09T18:37:30Z 268504833 30650147119                     922643842
2019-04-09T18:37:30Z 268505090 0                               0
2019-04-09T18:37:30Z 268505089 27204600390                     1439299142
2019-04-09T18:37:30Z 268504834 0                               0

name: pre-adva-iftab
time                 index     ifDescr
----                 -----     -------
2019-04-09T18:37:30Z 268504833 CH-1-15-C1
2019-04-09T18:37:30Z 268504834 CH-1-15-C2
2019-04-09T18:37:30Z 268504898 CH-1-15-NE
2019-04-09T18:37:30Z 268504899 CH-1-15-NW
2019-04-09T18:37:30Z 268505089 CH-1-16-C1
2019-04-09T18:37:30Z 268505090 CH-1-16-C2
2019-04-09T18:37:30Z 268505154 CH-1-16-NE
2019-04-09T18:37:30Z 268505155 CH-1-16-NW
2019-04-09T18:37:30Z 570499329 LINK-1-A-SER
2019-04-09T18:37:30Z 251728321 OM-1-17-1
2019-04-09T18:37:30Z 251728322 OM-1-17-2
2019-04-09T18:37:30Z 251728577 OM-1-18-1
2019-04-09T18:37:30Z 251728578 OM-1-18-2
2019-04-09T18:37:30Z 553722113 SC-1-A-C1
2019-04-09T18:37:30Z 553722114 SC-1-A-C2
2019-04-09T18:37:30Z 637599745 TIFI-1-FCU-1
2019-04-09T18:37:30Z 637599754 TIFI-1-FCU-10
2019-04-09T18:37:30Z 637599755 TIFI-1-FCU-11
2019-04-09T18:37:30Z 637599756 TIFI-1-FCU-12
2019-04-09T18:37:30Z 637599757 TIFI-1-FCU-13
2019-04-09T18:37:30Z 637599758 TIFI-1-FCU-14
2019-04-09T18:37:30Z 637599759 TIFI-1-FCU-15
2019-04-09T18:37:30Z 637599760 TIFI-1-FCU-16
2019-04-09T18:37:30Z 637599746 TIFI-1-FCU-2
2019-04-09T18:37:30Z 637599747 TIFI-1-FCU-3
2019-04-09T18:37:30Z 637599748 TIFI-1-FCU-4
2019-04-09T18:37:30Z 637599749 TIFI-1-FCU-5
2019-04-09T18:37:30Z 637599750 TIFI-1-FCU-6
2019-04-09T18:37:30Z 637599751 TIFI-1-FCU-7
2019-04-09T18:37:30Z 637599752 TIFI-1-FCU-8
2019-04-09T18:37:30Z 637599753 TIFI-1-FCU-9
2019-04-09T18:37:30Z 654376961 TIFO-1-FCU-1
2019-04-09T18:37:30Z 654376962 TIFO-1-FCU-2
2019-04-09T18:37:30Z 654376963 TIFO-1-FCU-3
2019-04-09T18:37:30Z 654376964 TIFO-1-FCU-4

results of join just fill up my memory: Zrzut ekranu 2019-04-9 o 20 31 53

Will you be willing to add such code, if someone prepares a patch for it?