unbounded memory growth to OOM with relatively simple scripts on version 1.5.7

joshdurbin commented 3 years ago

Overview

I'm trying to leverage Kapacitor 1.5.7 on Linux/amd64 for context aware traffic alerting in our multi-tenant commerce system at our ingress points for traffic spikes, etc... Typically this means data collection on two (optionally three -- in this example the data is sent but not grouped) fields:

The IP of the requester
The ID of the store for which the request is destined
(optionally) The URI

We collect and report on the data in 1-2 minute windows and don't care about any data outside such window. That is, if the point in question is over 2 minutes old, it should be expired and expunged. Kapacitor is a standalone Influx product for this use case -- there is no InfluxDB instance for this data, no retention policies, etc... Data is transmitted to Kapacitor via the UDP listener.

The format of the message is the following:

combined,uri=/path/to/some/product,id=123456789,ip=127.0.0.1,role=ingresstype count=1

As shown in singled-out stream stats later detailed in this issue, the cardinality of each of the aforementioned fields is roughly:

uri - unknown, medium
id - 20-30k within a minute window, avg
ip - 40-60k within a minute window, avg
role - at most 3

The count parameter exists so that we can run a mathematical sum operation on the data in the pipe. However, this is redundant because each request generates its own message, its own point, always.

We found Kapacitor struggling with unbounded memory growth in our Production systems, something we did not observe in other (non live traffic) environments. Our initial response to these uncontrollable runaway memory situations was to examine and reduce the cardinality of sets, particularly group by operations on streams. We initially tried reporting on the IP address, the Store ID, and the URI. These are all relatively high cardinality fields, but putting them together in an ordered group by wasn't helping with efficient and unbounded use of memory. So, we paired things back to the following tick script where the uri is dropped from the equation:

dbrp "toptraffic"."autogen"

var streamCounts = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('combined')
    |barrier()
        .period(1m)
        .delete(TRUE)
    |window()
        .period(1m)
        .every(5s)
        .align()
    |sum('count')
        .as('totalCount')

streamCounts
    |alert()
        .flapping(0.25, 0.5)
        .history(21)
        .warn(lambda: "totalCount" > 17500)
        .crit(lambda: "totalCount" > 22500)
        .message('''Observed  {{ index .Fields "totalCount" }} requests to Production Store ID {{ index .Tags "id" }} for IP {{ index .Tags "ip" }} within the last minute.''')
        .noRecoveries()
        .stateChangesOnly(5m)
        .slack()
        .channel('#ops-noise')

streamCounts
    |alert()
        .flapping(0.25, 0.5)
        .history(21)
        .warn(lambda: "totalCount" > 17500)
        .crit(lambda: "totalCount" > 22500)
        .message('''Observed  {{ index .Fields "totalCount" }} requests to Production Store ID {{ index .Tags "id" }} for IP {{ index .Tags "ip" }} within the last minute.''')
        .stateChangesOnly(5m)
        .exec('/usr/bin/kapacitor_pubsub_stdin_invoker.sh')
        .log('/var/log/kapacitor/alerts.log')

The script is straight forward enough; we group on the stream by ip, then id, from the combined measurement. A barrier exists to delete data after one minute. These operations are assigned to a stream variable which is used in alerting to do different things (at the same threshold).

The dot graph and sample output of that tick script while running renders:

DOT:
digraph top_combined {
graph [throughput="14278.31 points/s"];

stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ];
stream0 -> from1 [processed="518885257"];

from1 [avg_exec_time_ns="43.931µs" errors="0" working_cardinality="0" ];
from1 -> barrier2 [processed="518885257"];

barrier2 [avg_exec_time_ns="65.018µs" errors="0" working_cardinality="93843" ];
barrier2 -> window3 [processed="518865870"];

window3 [avg_exec_time_ns="107.889µs" errors="0" working_cardinality="93843" ];
window3 -> sum4 [processed="145863177"];

sum4 [avg_exec_time_ns="148.928µs" errors="0" working_cardinality="33327" ];
sum4 -> alert6 [processed="145863177"];
sum4 -> alert5 [processed="145863177"];

alert6 [alerts_inhibited="0" alerts_triggered="0" avg_exec_time_ns="53.506µs" crits_triggered="0" errors="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" working_cardinality="33327" ];

alert5 [alerts_inhibited="0" alerts_triggered="0" avg_exec_time_ns="63.915µs" crits_triggered="0" errors="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" working_cardinality="33327" ];
}

As previously mentioned, what we saw with this is that over time (pretty quickly) we ran out of memory. The following graph shows various tweaks to the aforementioned script changing things like the window and barrier periods didn't seem to make any difference to how fast the script/pipeline/Kapacitor consumed memory.

The various spikes in memory show me altering the tick script, removing the window, removing the barrier, changing the barrier from idle to period, changing the time of the barrier tick / window, etc... During these iterations I collected data. The data below is from engagement of the aforementioned tick script, with only changes to the window and barrier periods.

Heap dumps show the following for in use objects:

go tool pprof -inuse_objects --text kapacitord top_combined/heap\?debug=1
File: kapacitord
Type: inuse_objects
Showing nodes accounting for 255642703, 97.39% of 262504092 total
Dropped 98 nodes (cum <= 1312520)
      flat  flat%   sum%        cum   cum%
  47093992 17.94% 17.94%   47094003 17.94%  time.NewTicker
  35101126 13.37% 31.31%   35101126 13.37%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.Tags.Map
  34180645 13.02% 44.33%   34180645 13.02%  github.com/influxdata/kapacitor/edge.(*pointMessage).GroupInfo
  31006029 11.81% 56.14%   78231108 29.80%  github.com/influxdata/kapacitor.newPeriodicBarrier
  17670196  6.73% 62.88%   17670196  6.73%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).unmarshalBinary
  15859954  6.04% 68.92%   94091062 35.84%  github.com/influxdata/kapacitor.(*BarrierNode).newBarrier
  15840535  6.03% 74.95%   15840535  6.03%  github.com/influxdata/kapacitor.(*periodicBarrier).emitBarrier
  15281270  5.82% 80.77%   22687063  8.64%  github.com/influxdata/kapacitor/models.ToGroupID
  11141290  4.24% 85.02%   11141290  4.24%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Name
   7405793  2.82% 87.84%    7405793  2.82%  strings.(*Builder).WriteRune
   5028186  1.92% 89.75%    5028186  1.92%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/influxql.encodeTags
   3452307  1.32% 91.07%    8480493  3.23%  github.com/influxdata/kapacitor.convertFloatPoint
   1977093  0.75% 91.82%    3763058  1.43%  net.(*UDPConn).readFrom
   1835036   0.7% 92.52%    4403938  1.68%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.ParsePointsWithPrecision
   1818726  0.69% 93.21%    5581784  2.13%  github.com/influxdata/kapacitor/services/udp.(*Service).serve
   1785965  0.68% 93.89%    1785965  0.68%  syscall.anyToSockaddr
   1766316  0.67% 94.57%    2568902  0.98%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.parsePoint
   1758629  0.67% 95.24%    1758629  0.67%  github.com/influxdata/kapacitor/edge.BatchPointFromPoint
   1729659  0.66% 95.90%    1942655  0.74%  github.com/influxdata/kapacitor/edge.NewPointMessage
   1682284  0.64% 96.54%    1682284  0.64%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.parseTags
    769871  0.29% 96.83%    1687514  0.64%  github.com/influxdata/kapacitor/edge.(*statsEdge).incCollected
    766201  0.29% 97.12%    2120657  0.81%  github.com/influxdata/kapacitor.(*AlertNode).renderID
    163847 0.062% 97.19%    1429628  0.54%  github.com/influxdata/kapacitor.(*AlertNode).NewGroup
    158393  0.06% 97.25%   15998928  6.09%  github.com/influxdata/kapacitor.(*periodicBarrier).periodicEmitter
    103541 0.039% 97.28%    1862170  0.71%  github.com/influxdata/kapacitor.(*windowTimeBuffer).points
     98309 0.037% 97.32%    2690768  1.03%  github.com/influxdata/kapacitor.(*windowByTime).batch
     85587 0.033% 97.35%   97154573 37.01%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).getOrCreateGroup
     81923 0.031% 97.39%   94522520 36.01%  github.com/influxdata/kapacitor.(*BarrierNode).NewGroup
         0     0% 97.39%    3412833  1.30%  github.com/influxdata/kapacitor.(*AlertNode).runAlert
         0     0% 97.39%  126742493 48.28%  github.com/influxdata/kapacitor.(*BarrierNode).runBarrierEmitter
         0     0% 97.39%   23164149  8.82%  github.com/influxdata/kapacitor.(*FromNode).Point
         0     0% 97.39%   23619547  9.00%  github.com/influxdata/kapacitor.(*FromNode).runStream
         0     0% 97.39%   10632284  4.05%  github.com/influxdata/kapacitor.(*InfluxQLNode).runInfluxQL
         0     0% 97.39%   67208031 25.60%  github.com/influxdata/kapacitor.(*TaskMaster).WritePoints
         0     0% 97.39%    4753986  1.81%  github.com/influxdata/kapacitor.(*WindowNode).runWindow
         0     0% 97.39%    1420015  0.54%  github.com/influxdata/kapacitor.(*alertState).Point
         0     0% 97.39%    8480493  3.23%  github.com/influxdata/kapacitor.(*floatPointAggregator).AggregatePoint
         0     0% 97.39%    8824603  3.36%  github.com/influxdata/kapacitor.(*influxqlGroup).BatchPoint
         0     0% 97.39%  169161143 64.44%  github.com/influxdata/kapacitor.(*node).start.func1
         0     0% 97.39%    2230930  0.85%  github.com/influxdata/kapacitor.(*windowByTime).Point
         0     0% 97.39%  169160555 64.44%  github.com/influxdata/kapacitor/edge.(*consumer).Consume
         0     0% 97.39%    8824603  3.36%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).BatchPoint
         0     0% 97.39%    1360800  0.52%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).EndBatch
         0     0% 97.39%   27898115 10.63%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).Point
         0     0% 97.39%    1687514  0.64%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).forward
         0     0% 97.39%   10632284  4.05%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).BufferedBatch
         0     0% 97.39%  145541008 55.44%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Consume
         0     0% 97.39%  134249262 51.14%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Point
         0     0% 97.39%   22129995  8.43%  github.com/influxdata/kapacitor/edge.(*pointMessage).SetDimensions
         0     0% 97.39%    1623485  0.62%  github.com/influxdata/kapacitor/edge.(*streamStatsEdge).Collect
         0     0% 97.39%    8824603  3.36%  github.com/influxdata/kapacitor/edge.(*timedForwardReceiver).BatchPoint
         0     0% 97.39%   26815094 10.22%  github.com/influxdata/kapacitor/edge.(*timedForwardReceiver).Point
         0     0% 97.39%   10632284  4.05%  github.com/influxdata/kapacitor/edge.receiveBufferedBatch
         0     0% 97.39%   71611969 27.28%  github.com/influxdata/kapacitor/services/udp.(*Service).processPackets
         0     0% 97.39%   17670196  6.73%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Fields
         0     0% 97.39%    1682284  0.64%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Tags
         0     0% 97.39%    4403938  1.68%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.ParsePoints
         0     0% 97.39%    1785965  0.68%  internal/poll.(*FD).ReadFrom
         0     0% 97.39%    3763058  1.43%  net.(*UDPConn).ReadFromUDP
         0     0% 97.39%    1785965  0.68%  net.(*netFD).readFrom
         0     0% 97.39%  262393459   100%  runtime.goexit
         0     0% 97.39%    1785965  0.68%  syscall.Recvfrom

and for in use space:

go tool pprof --text kapacitord top_combined/heap\?debug=1 
File: kapacitord
Type: inuse_space
Showing nodes accounting for 19557.32MB, 97.33% of 20093.60MB total
Dropped 99 nodes (cum <= 100.47MB)
      flat  flat%   sum%        cum   cum%
 5447.82MB 27.11% 27.11%  5447.82MB 27.11%  github.com/influxdata/kapacitor/edge.(*pointMessage).GroupInfo
 4027.07MB 20.04% 47.15%  7133.04MB 35.50%  github.com/influxdata/kapacitor.newPeriodicBarrier
 3100.24MB 15.43% 62.58%  3101.97MB 15.44%  time.NewTicker
 1270.70MB  6.32% 68.91%  1270.70MB  6.32%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.Tags.Map
  964.19MB  4.80% 73.71%   964.19MB  4.80%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).unmarshalBinary
  886.50MB  4.41% 78.12%  8272.54MB 41.17%  github.com/influxdata/kapacitor.(*BarrierNode).NewGroup
  485.13MB  2.41% 80.53%   546.14MB  2.72%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.parsePoint
  485.01MB  2.41% 82.95%   485.01MB  2.41%  github.com/influxdata/kapacitor.(*periodicBarrier).emitBarrier
  460.51MB  2.29% 85.24%   686.52MB  3.42%  github.com/influxdata/kapacitor/models.ToGroupID
  305.03MB  1.52% 86.75%   602.55MB  3.00%  github.com/influxdata/kapacitor.convertFloatPoint
  297.52MB  1.48% 88.24%   297.52MB  1.48%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/influxql.encodeTags
     242MB  1.20% 89.44%  7375.04MB 36.70%  github.com/influxdata/kapacitor.(*BarrierNode).newBarrier
  237.53MB  1.18% 90.62%   242.53MB  1.21%  github.com/influxdata/kapacitor/edge.NewPointMessage
  226.01MB  1.12% 91.75%   226.01MB  1.12%  strings.(*Builder).WriteRune
  194.03MB  0.97% 92.71%   194.03MB  0.97%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.parseTags
     170MB  0.85% 93.56%      170MB  0.85%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Name
  142.02MB  0.71% 94.27%   142.02MB  0.71%  github.com/influxdata/kapacitor/edge.(*pointMessage).ShallowCopy
  119.01MB  0.59% 94.86%   318.52MB  1.59%  github.com/influxdata/kapacitor/services/udp.(*Service).serve
  109.01MB  0.54% 95.40%   109.01MB  0.54%  syscall.anyToSockaddr
   97.69MB  0.49% 95.89%   239.72MB  1.19%  github.com/influxdata/kapacitor/edge.(*statsEdge).incCollected
   90.50MB  0.45% 96.34%   199.51MB  0.99%  net.(*UDPConn).readFrom
   56.50MB  0.28% 96.62%   106.51MB  0.53%  github.com/influxdata/kapacitor.(*AlertNode).renderID
   55.12MB  0.27% 96.89%  8495.18MB 42.28%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).getOrCreateGroup
   32.18MB  0.16% 97.05%   112.69MB  0.56%  github.com/influxdata/kapacitor.(*windowTimeBuffer).points
      28MB  0.14% 97.19%   574.14MB  2.86%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.ParsePointsWithPrecision
   14.50MB 0.072% 97.26%   499.52MB  2.49%  github.com/influxdata/kapacitor.(*periodicBarrier).periodicEmitter
    7.50MB 0.037% 97.30%   101.01MB   0.5%  github.com/influxdata/kapacitor.(*AlertNode).NewGroup
       6MB  0.03% 97.33%   152.19MB  0.76%  github.com/influxdata/kapacitor.(*windowByTime).batch
         0     0% 97.33%   271.99MB  1.35%  github.com/influxdata/kapacitor.(*AlertNode).runAlert
         0     0% 97.33% 13426.72MB 66.82%  github.com/influxdata/kapacitor.(*BarrierNode).runBarrierEmitter
         0     0% 97.33%   815.04MB  4.06%  github.com/influxdata/kapacitor.(*FromNode).Point
         0     0% 97.33%   884.96MB  4.40%  github.com/influxdata/kapacitor.(*FromNode).runStream
         0     0% 97.33%   840.02MB  4.18%  github.com/influxdata/kapacitor.(*InfluxQLNode).runInfluxQL
         0     0% 97.33%  2820.44MB 14.04%  github.com/influxdata/kapacitor.(*TaskMaster).WritePoints
         0     0% 97.33%   411.47MB  2.05%  github.com/influxdata/kapacitor.(*WindowNode).runWindow
         0     0% 97.33%   602.55MB  3.00%  github.com/influxdata/kapacitor.(*floatPointAggregator).AggregatePoint
         0     0% 97.33%   649.56MB  3.23%  github.com/influxdata/kapacitor.(*influxqlGroup).BatchPoint
         0     0% 97.33% 15835.15MB 78.81%  github.com/influxdata/kapacitor.(*node).start.func1
         0     0% 97.33%   150.93MB  0.75%  github.com/influxdata/kapacitor.(*windowByTime).Point
         0     0% 97.33% 15829.73MB 78.78%  github.com/influxdata/kapacitor/edge.(*consumer).Consume
         0     0% 97.33%   649.56MB  3.23%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).BatchPoint
         0     0% 97.33%   153.94MB  0.77%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).EndBatch
         0     0% 97.33%  1192.77MB  5.94%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).Point
         0     0% 97.33%   239.72MB  1.19%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).forward
         0     0% 97.33%   840.02MB  4.18%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).BufferedBatch
         0     0% 97.33% 14944.77MB 74.38%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Consume
         0     0% 97.33% 14072.25MB 70.03%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Point
         0     0% 97.33%   673.02MB  3.35%  github.com/influxdata/kapacitor/edge.(*pointMessage).SetDimensions
         0     0% 97.33%   230.77MB  1.15%  github.com/influxdata/kapacitor/edge.(*streamStatsEdge).Collect
         0     0% 97.33%   649.56MB  3.23%  github.com/influxdata/kapacitor/edge.(*timedForwardReceiver).BatchPoint
         0     0% 97.33%  1037.97MB  5.17%  github.com/influxdata/kapacitor/edge.(*timedForwardReceiver).Point
         0     0% 97.33%   840.02MB  4.18%  github.com/influxdata/kapacitor/edge.receiveBufferedBatch
         0     0% 97.33%  3394.58MB 16.89%  github.com/influxdata/kapacitor/services/udp.(*Service).processPackets
         0     0% 97.33%   964.19MB  4.80%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Fields
         0     0% 97.33%   194.03MB  0.97%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Tags
         0     0% 97.33%   574.14MB  2.86%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.ParsePoints
         0     0% 97.33%   109.01MB  0.54%  internal/poll.(*FD).ReadFrom
         0     0% 97.33%   199.51MB  0.99%  net.(*UDPConn).ReadFromUDP
         0     0% 97.33%   109.01MB  0.54%  net.(*netFD).readFrom
         0     0% 97.33% 20051.80MB 99.79%  runtime.goexit
         0     0% 97.33%   109.01MB  0.54%  syscall.Recvfrom

A profile dump at roughly the same time shows:

go tool pprof --text kapacitord top_combined/profile      
File: kapacitord
Type: cpu
Time: Jan 25, 2021 at 7:32pm (PST)
Duration: 30.17s, Total samples = 43.72s (144.91%)
Showing nodes accounting for 34.25s, 78.34% of 43.72s total
Dropped 359 nodes (cum <= 0.22s)
      flat  flat%   sum%        cum   cum%
     3.20s  7.32%  7.32%      3.60s  8.23%  syscall.Syscall6
     2.96s  6.77% 14.09%      2.96s  6.77%  runtime.futex
     2.45s  5.60% 19.69%      2.45s  5.60%  runtime.epollwait
     1.87s  4.28% 23.97%      1.87s  4.28%  runtime.usleep
     1.54s  3.52% 27.49%      2.13s  4.87%  runtime.mapaccess2_faststr
     1.30s  2.97% 30.47%      6.20s 14.18%  runtime.mallocgc
     1.28s  2.93% 33.39%      1.28s  2.93%  runtime.nextFreeFast
     0.94s  2.15% 35.54%      1.18s  2.70%  runtime.heapBitsSetType
     0.94s  2.15% 37.69%      0.94s  2.15%  runtime.memclrNoHeapPointers
     0.88s  2.01% 39.71%      0.88s  2.01%  runtime.memmove
     0.86s  1.97% 41.67%      0.91s  2.08%  runtime.lock
     0.80s  1.83% 43.50%      3.64s  8.33%  runtime.selectgo
     0.69s  1.58% 45.08%      0.69s  1.58%  memeqbody
     0.64s  1.46% 46.55%      0.67s  1.53%  runtime.unlock
     0.60s  1.37% 47.92%      0.61s  1.40%  runtime.(*itabTableType).find
     0.50s  1.14% 49.06%      6.86s 15.69%  runtime.findrunnable
     0.46s  1.05% 50.11%      0.46s  1.05%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.scanLine
     0.45s  1.03% 51.14%      0.46s  1.05%  time.now
     0.42s  0.96% 52.10%      1.62s  3.71%  runtime.mapassign_faststr
     0.37s  0.85% 52.95%      0.37s  0.85%  aeshashbody
     0.33s  0.75% 53.71%      0.33s  0.75%  github.com/influxdata/kapacitor/edge.(*pointMessage).Fields
     0.33s  0.75% 54.46%      0.33s  0.75%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.scanTo
     0.32s  0.73% 55.19%      0.38s  0.87%  runtime.mapiternext
     0.29s  0.66% 55.86%      0.29s  0.66%  runtime.(*waitq).dequeue
     0.26s  0.59% 56.45%      3.93s  8.99%  runtime.newobject
     0.24s  0.55% 57.00%      3.44s  7.87%  github.com/influxdata/kapacitor/edge.(*streamStatsEdge).Collect
     0.24s  0.55% 57.55%      1.06s  2.42%  runtime.(*mcentral).cacheSpan
     0.24s  0.55% 58.10%      0.24s  0.55%  runtime.casgstatus
     0.24s  0.55% 58.65%      0.24s  0.55%  sync.(*RWMutex).RLock
     0.23s  0.53% 59.17%      0.23s  0.53%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.scanTagsValue
     0.23s  0.53% 59.70%      0.84s  1.92%  runtime.getitab
     0.23s  0.53% 60.22%      1.92s  4.39%  runtime.runqgrab
     0.22s   0.5% 60.73%      5.38s 12.31%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).Point
     0.21s  0.48% 61.21%      1.80s  4.12%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).forward
     0.21s  0.48% 61.69%      0.52s  1.19%  runtime.mapaccess1
     0.21s  0.48% 62.17%      2.73s  6.24%  runtime.netpoll
     0.20s  0.46% 62.63%      3.81s  8.71%  github.com/influxdata/kapacitor/edge.(*timedForwardReceiver).Point
     0.20s  0.46% 63.08%      1.56s  3.57%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/influxql.encodeTags
     0.20s  0.46% 63.54%      0.27s  0.62%  sync.(*RWMutex).Unlock
     0.19s  0.43% 63.98%      0.27s  0.62%  runtime.mapaccess1_faststr
     0.18s  0.41% 64.39%      1.44s  3.29%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.scanKey
     0.18s  0.41% 64.80%      0.97s  2.22%  runtime.sellock
     0.15s  0.34% 65.14%         3s  6.86%  github.com/influxdata/kapacitor.convertFloatPoint
     0.15s  0.34% 65.48%      1.17s  2.68%  github.com/influxdata/kapacitor/edge.(*statsEdge).incCollected
     0.15s  0.34% 65.83%      0.77s  1.76%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.parseTags
     0.14s  0.32% 66.15%      0.37s  0.85%  syscall.anyToSockaddr
     0.13s   0.3% 66.45%      0.72s  1.65%  github.com/influxdata/kapacitor/edge.(*pointMessage).GroupInfo
     0.13s   0.3% 66.74%      3.07s  7.02%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.ParsePointsWithPrecision
     0.13s   0.3% 67.04%      0.71s  1.62%  runtime.assertI2I2
     0.13s   0.3% 67.34%      0.92s  2.10%  runtime.slicebytetostring
     0.12s  0.27% 67.61%      0.54s  1.24%  github.com/influxdata/kapacitor/edge.NewPointMessage
     0.12s  0.27% 67.89%      1.54s  3.52%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.Tags.Map
     0.12s  0.27% 68.16%      0.42s  0.96%  runtime.makemap
     0.12s  0.27% 68.44%      0.40s  0.91%  runtime.mapiterinit
     0.12s  0.27% 68.71%      0.33s  0.75%  runtime.memhash
     0.11s  0.25% 68.96%      5.96s 13.63%  github.com/influxdata/kapacitor.(*TaskMaster).WritePoints
     0.11s  0.25% 69.21%      1.52s  3.48%  github.com/influxdata/kapacitor/edge.(*channelEdge).Emit
     0.10s  0.23% 69.44%      2.22s  5.08%  github.com/influxdata/kapacitor/edge.(*streamStatsEdge).Emit
     0.10s  0.23% 69.67%      0.33s  0.75%  github.com/influxdata/kapacitor/expvar.(*Map).Add
     0.10s  0.23% 69.90%      9.62s 22.00%  github.com/influxdata/kapacitor/services/udp.(*Service).processPackets
     0.10s  0.23% 70.13%      0.42s  0.96%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.scanTags
     0.09s  0.21% 70.33%      6.89s 15.76%  github.com/influxdata/kapacitor/services/udp.(*Service).serve
     0.09s  0.21% 70.54%      5.10s 11.67%  net.(*UDPConn).readFrom
     0.09s  0.21% 70.75%      7.64s 17.47%  runtime.schedule
     0.08s  0.18% 70.93%      1.28s  2.93%  github.com/influxdata/kapacitor.(*streamEdge).CollectPoint
     0.08s  0.18% 71.11%      6.38s 14.59%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Point
     0.08s  0.18% 71.29%      0.34s  0.78%  github.com/influxdata/kapacitor/edge.(*statsEdge).incEmitted
     0.08s  0.18% 71.48%      0.45s  1.03%  github.com/influxdata/kapacitor/models.ToGroupID
     0.08s  0.18% 71.66%      0.25s  0.57%  runtime.chanrecv
     0.08s  0.18% 71.84%      1.05s  2.40%  runtime.chansend
     0.08s  0.18% 72.03%      1.56s  3.57%  runtime.makeslice
     0.07s  0.16% 72.19%      1.13s  2.58%  github.com/influxdata/kapacitor.(*windowByTime).Point
     0.07s  0.16% 72.35%     14.28s 32.66%  github.com/influxdata/kapacitor/edge.(*consumer).Consume
     0.07s  0.16% 72.51%      0.24s  0.55%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Next
     0.07s  0.16% 72.67%      0.84s  1.92%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Tags
     0.07s  0.16% 72.83%      0.24s  0.55%  math/rand.(*Rand).Int63
     0.06s  0.14% 72.96%      0.70s  1.60%  github.com/influxdata/kapacitor.(*windowTimeBuffer).points
     0.06s  0.14% 73.10%      0.33s  0.75%  github.com/influxdata/kapacitor/timer.(*timer).Start
     0.06s  0.14% 73.24%      4.63s 10.59%  internal/poll.(*FD).ReadFrom
     0.06s  0.14% 73.38%      0.25s  0.57%  runtime.mapaccess2
     0.06s  0.14% 73.51%      0.95s  2.17%  runtime.notesleep
     0.05s  0.11% 73.63%      0.25s  0.57%  github.com/influxdata/kapacitor.(*AlertNode).serverInfo
     0.05s  0.11% 73.74%      0.55s  1.26%  github.com/influxdata/kapacitor.(*StreamNode).runSourceStream
     0.05s  0.11% 73.86%      1.29s  2.95%  github.com/influxdata/kapacitor.(*TaskMaster).forkPoint
     0.05s  0.11% 73.97%      0.88s  2.01%  github.com/influxdata/kapacitor.(*streamEdge).EmitPoint
     0.05s  0.11% 74.09%      2.12s  4.85%  github.com/influxdata/kapacitor/edge.(*channelEdge).Collect
     0.05s  0.11% 74.20%      5.15s 11.78%  net.(*UDPConn).ReadFromUDP
     0.05s  0.11% 74.31%      0.31s  0.71%  runtime.convI2I
     0.05s  0.11% 74.43%      0.60s  1.37%  runtime.resetspinning
     0.05s  0.11% 74.54%      1.97s  4.51%  runtime.runqsteal
     0.05s  0.11% 74.66%      1.52s  3.48%  runtime.send
     0.05s  0.11% 74.77%      0.37s  0.85%  runtime.strhash
     0.05s  0.11% 74.89%      0.24s  0.55%  runtime.typedmemmove
     0.05s  0.11% 75.00%      3.65s  8.35%  syscall.recvfrom
     0.05s  0.11% 75.11%      0.36s  0.82%  text/template.(*state).evalField
     0.04s 0.091% 75.21%      1.59s  3.64%  github.com/influxdata/kapacitor.(*alertState).Point
     0.04s 0.091% 75.30%      0.53s  1.21%  github.com/influxdata/kapacitor.EvalPredicate
     0.04s 0.091% 75.39%      0.45s  1.03%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).Barrier
     0.04s 0.091% 75.48%      1.42s  3.25%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).unmarshalBinary
     0.04s 0.091% 75.57%      2.16s  4.94%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.parsePoint
     0.04s 0.091% 75.66%      4.67s 10.68%  net.(*netFD).readFrom
     0.04s 0.091% 75.75%      0.42s  0.96%  runtime.selunlock
     0.04s 0.091% 75.85%      0.36s  0.82%  runtime.sysmon
     0.04s 0.091% 75.94%      0.32s  0.73%  sort.Strings
     0.04s 0.091% 76.03%      4.09s  9.35%  syscall.Recvfrom
     0.04s 0.091% 76.12%      0.77s  1.76%  text/template.(*state).walk
     0.03s 0.069% 76.19%      0.47s  1.08%  github.com/influxdata/kapacitor.(*FromNode).Point
     0.03s 0.069% 76.26%      1.09s  2.49%  github.com/influxdata/kapacitor.(*windowByTime).batch
     0.03s 0.069% 76.33%      1.57s  3.59%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).getOrCreateGroup
     0.03s 0.069% 76.40%      0.59s  1.35%  github.com/influxdata/kapacitor/edge.BatchPointFromPoint
     0.03s 0.069% 76.46%      4.16s  9.52%  github.com/influxdata/kapacitor/edge.receiveBufferedBatch
     0.03s 0.069% 76.53%      1.09s  2.49%  runtime.(*mcache).refill
     0.03s 0.069% 76.60%      0.22s   0.5%  runtime.entersyscall
     0.03s 0.069% 76.67%      0.29s  0.66%  runtime.gentraceback
     0.03s 0.069% 76.74%      7.87s 18.00%  runtime.mcall
     0.03s 0.069% 76.81%      0.27s  0.62%  runtime.notetsleep_internal
     0.03s 0.069% 76.88%      1.88s  4.30%  runtime.startm
     0.03s 0.069% 76.94%         2s  4.57%  runtime.systemstack
     0.03s 0.069% 77.01%      0.26s  0.59%  strconv.ParseFloat
     0.03s 0.069% 77.08%      0.42s  0.96%  text/template.(*state).evalCommand
     0.03s 0.069% 77.15%      0.39s  0.89%  text/template.(*state).evalFieldChain
     0.02s 0.046% 77.20%      0.56s  1.28%  github.com/influxdata/kapacitor.(*AlertNode).determineLevel
     0.02s 0.046% 77.24%      2.19s  5.01%  github.com/influxdata/kapacitor.(*TaskMaster).runForking
     0.02s 0.046% 77.29%      3.06s  7.00%  github.com/influxdata/kapacitor.(*floatPointAggregator).AggregatePoint
     0.02s 0.046% 77.33%      0.51s  1.17%  github.com/influxdata/kapacitor.(*periodicBarrier).emitBarrier
     0.02s 0.046% 77.38%      0.72s  1.65%  github.com/influxdata/kapacitor.(*periodicBarrier).periodicEmitter
     0.02s 0.046% 77.42%      3.36s  7.69%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).BatchPoint
     0.02s 0.046% 77.47%      4.16s  9.52%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).BufferedBatch
     0.02s 0.046% 77.52%      1.44s  3.29%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Fields
     0.02s 0.046% 77.56%      0.29s  0.66%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.parseFloatBytes
     0.02s 0.046% 77.61%      0.26s  0.59%  math/rand.(*Rand).Float64
     0.02s 0.046% 77.65%      1.24s  2.84%  runtime.(*mcache).nextFree
     0.02s 0.046% 77.70%      0.28s  0.64%  runtime.(*mheap).alloc_m
     0.02s 0.046% 77.74%      1.01s  2.31%  runtime.chansend1
     0.02s 0.046% 77.79%      1.30s  2.97%  runtime.goready
     0.02s 0.046% 77.84%      7.83s 17.91%  runtime.park_m
     0.02s 0.046% 77.88%      1.01s  2.31%  runtime.stopm
     0.02s 0.046% 77.93%      0.85s  1.94%  text/template.(*Template).execute
     0.01s 0.023% 77.95%      0.54s  1.24%  github.com/influxdata/kapacitor.(*AlertNode).findFirstMatchLevel
     0.01s 0.023% 77.97%      0.22s   0.5%  github.com/influxdata/kapacitor/edge.(*batchStatsEdge).Collect
     0.01s 0.023% 78.00%      0.29s  0.66%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).DeleteGroup
     0.01s 0.023% 78.02%      0.27s  0.62%  github.com/influxdata/kapacitor/edge.(*pointMessage).SetDimensions
     0.01s 0.023% 78.04%      0.32s  0.73%  github.com/influxdata/kapacitor/edge.(*timedForwardReceiver).Barrier
     0.01s 0.023% 78.06%      0.38s  0.87%  github.com/influxdata/kapacitor/edge.Forward
     0.01s 0.023% 78.09%      0.22s   0.5%  github.com/influxdata/kapacitor/tick/stateful.(*expression).EvalBool
     0.01s 0.023% 78.11%      3.14s  7.18%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.ParsePoints
     0.01s 0.023% 78.13%      0.43s  0.98%  runtime.(*mheap).alloc
     0.01s 0.023% 78.16%      0.24s  0.55%  runtime.(*mheap).allocSpanLocked
     0.01s 0.023% 78.18%      1.84s  4.21%  runtime.futexwakeup
     0.01s 0.023% 78.20%      1.28s  2.93%  runtime.goready.func1
     0.01s 0.023% 78.23%      0.23s  0.53%  runtime.makemap_small
     0.01s 0.023% 78.25%      1.82s  4.16%  runtime.notewakeup
     0.01s 0.023% 78.27%      1.27s  2.90%  runtime.ready
     0.01s 0.023% 78.29%      1.66s  3.80%  runtime.wakep
     0.01s 0.023% 78.32%      0.22s   0.5%  sort.Sort
     0.01s 0.023% 78.34%      0.30s  0.69%  text/template.(*state).printValue
         0     0% 78.34%      0.65s  1.49%  github.com/influxdata/kapacitor.(*AlertNode).NewGroup
         0     0% 78.34%      1.26s  2.88%  github.com/influxdata/kapacitor.(*AlertNode).renderID
         0     0% 78.34%      3.07s  7.02%  github.com/influxdata/kapacitor.(*AlertNode).runAlert
         0     0% 78.34%      0.39s  0.89%  github.com/influxdata/kapacitor.(*BarrierNode).NewGroup
         0     0% 78.34%      0.30s  0.69%  github.com/influxdata/kapacitor.(*BarrierNode).newBarrier
         0     0% 78.34%      2.41s  5.51%  github.com/influxdata/kapacitor.(*BarrierNode).runBarrierEmitter
         0     0% 78.34%      1.50s  3.43%  github.com/influxdata/kapacitor.(*FromNode).runStream
         0     0% 78.34%      4.33s  9.90%  github.com/influxdata/kapacitor.(*InfluxQLNode).runInfluxQL
         0     0% 78.34%      2.19s  5.01%  github.com/influxdata/kapacitor.(*TaskMaster).stream.func1
         0     0% 78.34%      2.97s  6.79%  github.com/influxdata/kapacitor.(*WindowNode).runWindow
         0     0% 78.34%      3.29s  7.53%  github.com/influxdata/kapacitor.(*influxqlGroup).BatchPoint
         0     0% 78.34%      0.22s   0.5%  github.com/influxdata/kapacitor.(*influxqlGroup).realizeReduceContextFromFields
         0     0% 78.34%     14.83s 33.92%  github.com/influxdata/kapacitor.(*node).start.func1
         0     0% 78.34%      0.30s  0.69%  github.com/influxdata/kapacitor.(*windowByTime).Barrier
         0     0% 78.34%      0.29s  0.66%  github.com/influxdata/kapacitor.newPeriodicBarrier
         0     0% 78.34%      0.51s  1.17%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).EndBatch
         0     0% 78.34%      0.46s  1.05%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Barrier
         0     0% 78.34%     12.78s 29.23%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Consume
         0     0% 78.34%      3.33s  7.62%  github.com/influxdata/kapacitor/edge.(*timedForwardReceiver).BatchPoint
         0     0% 78.34%      0.32s  0.73%  github.com/influxdata/kapacitor/edge.NewBeginBatchMessage
         0     0% 78.34%      1.71s  3.91%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/influxql.NewTags
         0     0% 78.34%      0.29s  0.66%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).FloatValue
         0     0% 78.34%      0.46s  1.05%  runtime.(*mcentral).grow
         0     0% 78.34%      0.28s  0.64%  runtime.(*mheap).alloc.func1
         0     0% 78.34%      0.22s   0.5%  runtime.chanrecv2
         0     0% 78.34%      0.28s  0.64%  runtime.copystack
         0     0% 78.34%      0.24s  0.55%  runtime.entersyscallblock
         0     0% 78.34%      0.24s  0.55%  runtime.entersyscallblock_handoff
         0     0% 78.34%      1.13s  2.58%  runtime.futexsleep
         0     0% 78.34%      0.24s  0.55%  runtime.handoffp
         0     0% 78.34%      0.34s  0.78%  runtime.mstart
         0     0% 78.34%      0.34s  0.78%  runtime.mstart1
         0     0% 78.34%      0.29s  0.66%  runtime.newstack
         0     0% 78.34%      0.48s  1.10%  runtime.notetsleepg
         0     0% 78.34%      0.66s  1.51%  runtime.timerproc
         0     0% 78.34%      0.85s  1.94%  text/template.(*Template).Execute
         0     0% 78.34%      0.39s  0.89%  text/template.(*state).evalFieldNode
         0     0% 78.34%      0.42s  0.96%  text/template.(*state).evalPipeline

Perplexed I decided to chop things up and create two tick scripts instead that monitor each of those metrics independently. The first, top_ips does no variable assignment in the tick script and things are piped together in a single flow. The second, top_stores has assignment and piping such that data streams to two alerts that do slightly different things with those triggers, like the aforementioned combined script.

Data to the measurement ips looks like:

ips,ip=127.0.0.1,role= ingresstype

Here's the show output for top_ips:

ID: top_ips
Error: 
Template: 
Type: stream
Status: enabled
Executing: true
Created: 22 Jan 21 22:25 UTC
Modified: 26 Jan 21 06:52 UTC
LastEnabled: 26 Jan 21 06:52 UTC
Databases Retention Policies: ["toptraffic"."autogen"]
TICKscript:
dbrp "toptraffic"."autogen"

stream
    |from()
        .groupBy('ip')
        .measurement('ips')
    |barrier()
        .period(1m)
        .delete(TRUE)
    |window()
        .period(1m)
        .every(5s)
        .align()
    |sum('count')
        .as('totalCount')
    |alert()
        .flapping(0.25, 0.5)
        .history(21)
        .warn(lambda: "totalCount" > 17500)
        .crit(lambda: "totalCount" > 22500)
        .message('''Observed  {{ index .Fields "totalCount" }} requests to Production IP {{ index .Tags "ip" }} within the last 1 minute.''')
        .stateChangesOnly(5m)
        .slack()
        .channel('#ops-noise')
        .exec('/usr/bin/kapacitor_pubsub_stdin_invoker.sh')
        .log('/var/log/kapacitor/alerts.log')

DOT:
digraph top_ips {
graph [throughput="18076.66 points/s"];

stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ];
stream0 -> from1 [processed="1891796886"];

from1 [avg_exec_time_ns="45.655µs" errors="0" working_cardinality="0" ];
from1 -> barrier2 [processed="1891796886"];

barrier2 [avg_exec_time_ns="22.507µs" errors="0" working_cardinality="58218" ];
barrier2 -> window3 [processed="1891721166"];

window3 [avg_exec_time_ns="251.156µs" errors="0" working_cardinality="58218" ];
window3 -> sum4 [processed="376239068"];

sum4 [avg_exec_time_ns="101.993µs" errors="0" working_cardinality="34455" ];
sum4 -> alert5 [processed="376239068"];

alert5 [alerts_inhibited="0" alerts_triggered="835" avg_exec_time_ns="58.686µs" crits_triggered="101" errors="0" infos_triggered="0" oks_triggered="367" warns_triggered="367" working_cardinality="34455" ];
}

... and for top stores the data looks like:

stores,id=123456789,role=ingresstype

with a evaluated script like:

ID: top_stores
Error: 
Template: 
Type: stream
Status: enabled
Executing: true
Created: 22 Jan 21 22:30 UTC
Modified: 26 Jan 21 06:05 UTC
LastEnabled: 26 Jan 21 06:05 UTC
Databases Retention Policies: ["toptraffic"."autogen"]
TICKscript:
dbrp "toptraffic"."autogen"

var stores = stream
    |from()
        .groupBy('id')
        .measurement('stores')
    |barrier()
        .period(1m)
        .delete(TRUE)
    |window()
        .period(1m)
        .every(5s)
        .align()
    |sum('count')
        .as('totalCount')

stores
    |alert()
        .flapping(0.25, 0.5)
        .history(21)
        .warn(lambda: "totalCount" > 17500)
        .crit(lambda: "totalCount" > 22500)
        .message('''Observed  {{ index .Fields "totalCount" }} requests to Production Store ID {{ index .Tags "id" }} within the last minute.''')
        .noRecoveries()
        .stateChangesOnly(5m)
        .slack()
        .channel('#ops-noise')

stores
    |alert()
        .flapping(0.25, 0.5)
        .history(21)
        .warn(lambda: "totalCount" > 17500)
        .crit(lambda: "totalCount" > 22500)
        .message('''Observed  {{ index .Fields "totalCount" }} requests to Production Store ID {{ index .Tags "id" }} within the last minute.''')
        .stateChangesOnly(5m)
        .exec('/usr/bin/kapacitor_pubsub_stdin_invoker.sh')
        .log('/var/log/kapacitor/alerts.log')

DOT:
digraph top_stores {
graph [throughput="15742.66 points/s"];

stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ];
stream0 -> from1 [processed="1816122105"];

from1 [avg_exec_time_ns="25.922µs" errors="0" working_cardinality="0" ];
from1 -> barrier2 [processed="1816122105"];

barrier2 [avg_exec_time_ns="16.675µs" errors="0" working_cardinality="19560" ];
barrier2 -> window3 [processed="1816013270"];

window3 [avg_exec_time_ns="164.84µs" errors="0" working_cardinality="19560" ];
window3 -> sum4 [processed="191193880"];

sum4 [avg_exec_time_ns="592.492µs" errors="0" working_cardinality="12185" ];
sum4 -> alert6 [processed="191193879"];
sum4 -> alert5 [processed="191193879"];

alert6 [alerts_inhibited="0" alerts_triggered="1375" avg_exec_time_ns="93.317µs" crits_triggered="206" errors="0" infos_triggered="0" oks_triggered="586" warns_triggered="583" working_cardinality="12185" ];

alert5 [alerts_inhibited="0" alerts_triggered="789" avg_exec_time_ns="229.573µs" crits_triggered="206" errors="0" infos_triggered="0" oks_triggered="0" warns_triggered="583" working_cardinality="12185" ];
}

Note the cardinality of these, at least at the time sampled, was ~12K for store IDs and ~34k for IPs. These on their own seem small potatoes, and, even in the combined script where a group by splits first by the IP, then the Store, shouldn't be too much data for a one or two minute window.

At first this seemed to be a more stable approach, memory didn't seem to grow as fast and I thought we'd level off. Unfortunately, as the graph shows below, we did not.

Heap dumps show the following for in use objects:

go tool pprof -inuse_objects --text kapacitord top_ip_and_store_id_last/heap\?debug=1 
File: kapacitord
Type: inuse_objects
Showing nodes accounting for 247823085, 97.68% of 253717036 total
Dropped 130 nodes (cum <= 1268585)
      flat  flat%   sum%        cum   cum%
  43042527 16.96% 16.96%   43042532 16.96%  time.NewTicker
  33785532 13.32% 30.28%   33785532 13.32%  github.com/influxdata/kapacitor/edge.(*pointMessage).GroupInfo
  29265618 11.53% 41.82%   72471995 28.56%  github.com/influxdata/kapacitor.newPeriodicBarrier
  28924029 11.40% 53.22%   28924029 11.40%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.Tags.Map
  26395161 10.40% 63.62%   26395161 10.40%  github.com/influxdata/kapacitor/models.ToGroupID
  14090455  5.55% 69.17%   86562450 34.12%  github.com/influxdata/kapacitor.(*BarrierNode).newBarrier
  13910441  5.48% 74.66%   13910441  5.48%  github.com/influxdata/kapacitor.(*periodicBarrier).emitBarrier
  13590942  5.36% 80.01%   13590942  5.36%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/influxql.encodeTags
   9683888  3.82% 83.83%    9683888  3.82%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).unmarshalBinary
   9202897  3.63% 87.46%   22793839  8.98%  github.com/influxdata/kapacitor.convertFloatPoint
   7045227  2.78% 90.23%    7045227  2.78%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Name
   4871732  1.92% 92.15%    4871732  1.92%  github.com/influxdata/kapacitor/edge.BatchPointFromPoint
   1784280   0.7% 92.86%    1784280   0.7%  github.com/influxdata/kapacitor/edge.(*pointMessage).ShallowCopy
   1693245  0.67% 93.52%    2004546  0.79%  github.com/influxdata/kapacitor/edge.NewPointMessage
   1658880  0.65% 94.18%    2491189  0.98%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.parsePoint
   1656757  0.65% 94.83%    4679768  1.84%  github.com/influxdata/kapacitor/services/udp.(*Service).serve
   1646692  0.65% 95.48%    1646692  0.65%  syscall.anyToSockaddr
   1627118  0.64% 96.12%    1640457  0.65%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.parseTags
   1376319  0.54% 96.66%    3023011  1.19%  net.(*UDPConn).readFrom
   1376277  0.54% 97.21%    3867466  1.52%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.ParsePointsWithPrecision
    617947  0.24% 97.45%    2016082  0.79%  github.com/influxdata/kapacitor.(*AlertNode).renderID
    163847 0.065% 97.51%   86966602 34.28%  github.com/influxdata/kapacitor.(*BarrierNode).NewGroup
    152931  0.06% 97.57%   14063372  5.54%  github.com/influxdata/kapacitor.(*periodicBarrier).periodicEmitter
    108311 0.043% 97.62%    4980043  1.96%  github.com/influxdata/kapacitor.(*windowTimeBuffer).points
    106502 0.042% 97.66%    5704480  2.25%  github.com/influxdata/kapacitor.(*windowByTime).batch
     45530 0.018% 97.68%   88450861 34.86%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).getOrCreateGroup
         0     0% 97.68%    2851193  1.12%  github.com/influxdata/kapacitor.(*AlertNode).runAlert
         0     0% 97.68%  118282743 46.62%  github.com/influxdata/kapacitor.(*BarrierNode).runBarrierEmitter
         0     0% 97.68%   27556840 10.86%  github.com/influxdata/kapacitor.(*FromNode).Point
         0     0% 97.68%   27968454 11.02%  github.com/influxdata/kapacitor.(*FromNode).runStream
         0     0% 97.68%   24799627  9.77%  github.com/influxdata/kapacitor.(*InfluxQLNode).runInfluxQL
         0     0% 97.68%   48848474 19.25%  github.com/influxdata/kapacitor.(*TaskMaster).WritePoints
         0     0% 97.68%    8255397  3.25%  github.com/influxdata/kapacitor.(*WindowNode).runWindow
         0     0% 97.68%    1802300  0.71%  github.com/influxdata/kapacitor.(*alertState).Point
         0     0% 97.68%   22793839  8.98%  github.com/influxdata/kapacitor.(*floatPointAggregator).AggregatePoint
         0     0% 97.68%   23395427  9.22%  github.com/influxdata/kapacitor.(*influxqlGroup).BatchPoint
         0     0% 97.68%  182157414 71.80%  github.com/influxdata/kapacitor.(*node).start.func1
         0     0% 97.68%    5456825  2.15%  github.com/influxdata/kapacitor.(*windowByTime).Point
         0     0% 97.68%  182157408 71.80%  github.com/influxdata/kapacitor/edge.(*consumer).Consume
         0     0% 97.68%   23395427  9.22%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).BatchPoint
         0     0% 97.68%   35505282 13.99%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).Point
         0     0% 97.68%   24799627  9.77%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).BufferedBatch
         0     0% 97.68%  154188954 60.77%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Consume
         0     0% 97.68%  128939059 50.82%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Point
         0     0% 97.68%   25772560 10.16%  github.com/influxdata/kapacitor/edge.(*pointMessage).SetDimensions
         0     0% 97.68%   23395427  9.22%  github.com/influxdata/kapacitor/edge.(*timedForwardReceiver).BatchPoint
         0     0% 97.68%   34815965 13.72%  github.com/influxdata/kapacitor/edge.(*timedForwardReceiver).Point
         0     0% 97.68%   24799627  9.77%  github.com/influxdata/kapacitor/edge.receiveBufferedBatch
         0     0% 97.68%   52715940 20.78%  github.com/influxdata/kapacitor/services/udp.(*Service).processPackets
         0     0% 97.68%    9683888  3.82%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Fields
         0     0% 97.68%    1640457  0.65%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Tags
         0     0% 97.68%    3867466  1.52%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.ParsePoints
         0     0% 97.68%    1646692  0.65%  internal/poll.(*FD).ReadFrom
         0     0% 97.68%    3023011  1.19%  net.(*UDPConn).ReadFromUDP
         0     0% 97.68%    1646692  0.65%  net.(*netFD).readFrom
         0     0% 97.68%  253634506   100%  runtime.goexit
         0     0% 97.68%    1646692  0.65%  syscall.Recvfrom

and for in use space:

go tool pprof --text kapacitord top_ip_and_store_id_last/heap\?debug=1 
File: kapacitord
Type: inuse_space
Showing nodes accounting for 19161.22MB, 97.53% of 19646.18MB total
Dropped 128 nodes (cum <= 98.23MB)
      flat  flat%   sum%        cum   cum%
 5369.30MB 27.33% 27.33%  5369.30MB 27.33%  github.com/influxdata/kapacitor/edge.(*pointMessage).GroupInfo
 3794.53MB 19.31% 46.64%  6652.30MB 33.86%  github.com/influxdata/kapacitor.newPeriodicBarrier
 2852.22MB 14.52% 61.16%  2852.77MB 14.52%  time.NewTicker
 1230.21MB  6.26% 67.42%  1230.21MB  6.26%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.Tags.Map
  945.72MB  4.81% 72.24%   945.72MB  4.81%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).unmarshalBinary
  892.46MB  4.54% 76.78%  7766.76MB 39.53%  github.com/influxdata/kapacitor.(*BarrierNode).NewGroup
  551.54MB  2.81% 79.59%   966.55MB  4.92%  github.com/influxdata/kapacitor.convertFloatPoint
  541.01MB  2.75% 82.34%   541.01MB  2.75%  github.com/influxdata/kapacitor/models.ToGroupID
  455.63MB  2.32% 84.66%   521.63MB  2.66%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.parsePoint
  426.01MB  2.17% 86.83%   426.01MB  2.17%  github.com/influxdata/kapacitor.(*periodicBarrier).emitBarrier
  415.01MB  2.11% 88.94%   415.01MB  2.11%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/influxql.encodeTags
  245.03MB  1.25% 90.19%   245.03MB  1.25%  github.com/influxdata/kapacitor/edge.(*pointMessage).ShallowCopy
  232.53MB  1.18% 91.37%   238.03MB  1.21%  github.com/influxdata/kapacitor/edge.NewPointMessage
  223.01MB  1.14% 92.51%   223.01MB  1.14%  github.com/influxdata/kapacitor/edge.BatchPointFromPoint
     215MB  1.09% 93.60%  6867.30MB 34.95%  github.com/influxdata/kapacitor.(*BarrierNode).newBarrier
  188.02MB  0.96% 94.56%   190.53MB  0.97%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.parseTags
  108.51MB  0.55% 95.11%   272.02MB  1.38%  github.com/influxdata/kapacitor/services/udp.(*Service).serve
  107.50MB  0.55% 95.66%   107.50MB  0.55%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Name
  100.51MB  0.51% 96.17%   100.51MB  0.51%  syscall.anyToSockaddr
   83.84MB  0.43% 96.60%   306.85MB  1.56%  github.com/influxdata/kapacitor.(*windowTimeBuffer).points
      63MB  0.32% 96.92%   163.51MB  0.83%  net.(*UDPConn).readFrom
   46.70MB  0.24% 97.16%   146.72MB  0.75%  github.com/influxdata/kapacitor/edge.(*statsEdge).incCollected
   32.42MB  0.17% 97.32%  7904.19MB 40.23%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).getOrCreateGroup
      21MB  0.11% 97.43%   542.63MB  2.76%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.ParsePointsWithPrecision
      14MB 0.071% 97.50%   440.01MB  2.24%  github.com/influxdata/kapacitor.(*periodicBarrier).periodicEmitter
    6.50MB 0.033% 97.53%   340.85MB  1.73%  github.com/influxdata/kapacitor.(*windowByTime).batch
         0     0% 97.53%   192.18MB  0.98%  github.com/influxdata/kapacitor.(*AlertNode).runAlert
         0     0% 97.53% 12749.75MB 64.90%  github.com/influxdata/kapacitor.(*BarrierNode).runBarrierEmitter
         0     0% 97.53%   774.55MB  3.94%  github.com/influxdata/kapacitor.(*FromNode).Point
         0     0% 97.53%   832.51MB  4.24%  github.com/influxdata/kapacitor.(*FromNode).runStream
         0     0% 97.53%  1171.02MB  5.96%  github.com/influxdata/kapacitor.(*InfluxQLNode).runInfluxQL
         0     0% 97.53%  2687.48MB 13.68%  github.com/influxdata/kapacitor.(*TaskMaster).WritePoints
         0     0% 97.53%      724MB  3.69%  github.com/influxdata/kapacitor.(*WindowNode).runWindow
         0     0% 97.53%   966.55MB  4.92%  github.com/influxdata/kapacitor.(*floatPointAggregator).AggregatePoint
         0     0% 97.53%  1028.06MB  5.23%  github.com/influxdata/kapacitor.(*influxqlGroup).BatchPoint
         0     0% 97.53% 15669.47MB 79.76%  github.com/influxdata/kapacitor.(*node).start.func1
         0     0% 97.53%   370.80MB  1.89%  github.com/influxdata/kapacitor.(*windowByTime).Point
         0     0% 97.53% 15659.62MB 79.71%  github.com/influxdata/kapacitor/edge.(*consumer).Consume
         0     0% 97.53%  1028.06MB  5.23%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).BatchPoint
         0     0% 97.53%   118.41MB   0.6%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).EndBatch
         0     0% 97.53%  1316.67MB  6.70%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).Point
         0     0% 97.53%   147.22MB  0.75%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).forward
         0     0% 97.53%  1171.02MB  5.96%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).BufferedBatch
         0     0% 97.53% 14827.11MB 75.47%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Consume
         0     0% 97.53% 13633.07MB 69.39%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Point
         0     0% 97.53%   529.51MB  2.70%  github.com/influxdata/kapacitor/edge.(*pointMessage).SetDimensions
         0     0% 97.53%   140.38MB  0.71%  github.com/influxdata/kapacitor/edge.(*streamStatsEdge).Collect
         0     0% 97.53%  1028.06MB  5.23%  github.com/influxdata/kapacitor/edge.(*timedForwardReceiver).BatchPoint
         0     0% 97.53%  1208.85MB  6.15%  github.com/influxdata/kapacitor/edge.(*timedForwardReceiver).Point
         0     0% 97.53%  1171.02MB  5.96%  github.com/influxdata/kapacitor/edge.receiveBufferedBatch
         0     0% 97.53%  3230.11MB 16.44%  github.com/influxdata/kapacitor/services/udp.(*Service).processPackets
         0     0% 97.53%   945.72MB  4.81%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Fields
         0     0% 97.53%   190.53MB  0.97%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Tags
         0     0% 97.53%   542.63MB  2.76%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.ParsePoints
         0     0% 97.53%   100.51MB  0.51%  internal/poll.(*FD).ReadFrom
         0     0% 97.53%   163.51MB  0.83%  net.(*UDPConn).ReadFromUDP
         0     0% 97.53%   100.51MB  0.51%  net.(*netFD).readFrom
         0     0% 97.53% 19615.16MB 99.84%  runtime.goexit
         0     0% 97.53%   100.51MB  0.51%  syscall.Recvfrom

A profile dump at roughly the same time shows:

go tool pprof --text kapacitord top_ip_and_store_id_last/profile 
File: kapacitord
Type: cpu
Time: Jan 27, 2021 at 2:20pm (PST)
Duration: 30.17s, Total samples = 66.49s (220.40%)
Showing nodes accounting for 56.07s, 84.33% of 66.49s total
Dropped 417 nodes (cum <= 0.33s)
      flat  flat%   sum%        cum   cum%
    11.90s 17.90% 17.90%     12.39s 18.63%  runtime.findObject
     9.26s 13.93% 31.82%     26.13s 39.30%  runtime.scanobject
     5.15s  7.75% 39.57%      5.15s  7.75%  runtime.markBits.isMarked
     2.63s  3.96% 43.53%      2.78s  4.18%  syscall.Syscall6
     2.17s  3.26% 46.79%      2.88s  4.33%  runtime.mapaccess2_faststr
     1.32s  1.99% 48.77%      1.32s  1.99%  runtime.epollwait
     1.26s  1.90% 50.67%      9.32s 14.02%  runtime.mallocgc
     1.25s  1.88% 52.55%      1.25s  1.88%  runtime.futex
     1.17s  1.76% 54.31%      1.19s  1.79%  runtime.pageIndexOf
     0.96s  1.44% 55.75%      1.13s  1.70%  runtime.heapBitsSetType
     0.90s  1.35% 57.11%      0.90s  1.35%  runtime.usleep
     0.84s  1.26% 58.37%      0.84s  1.26%  runtime.nextFreeFast
     0.80s  1.20% 59.57%      0.83s  1.25%  runtime.(*itabTableType).find
     0.74s  1.11% 60.69%      0.74s  1.11%  runtime.memclrNoHeapPointers
     0.73s  1.10% 61.78%      4.10s  6.17%  runtime.selectgo
     0.71s  1.07% 62.85%      3.78s  5.69%  runtime.gcWriteBarrier
     0.70s  1.05% 63.90%      0.70s  1.05%  memeqbody
     0.67s  1.01% 64.91%      0.77s  1.16%  runtime.lock
     0.56s  0.84% 65.75%      0.60s   0.9%  runtime.spanOf
     0.55s  0.83% 66.58%      0.62s  0.93%  runtime.unlock
     0.52s  0.78% 67.36%      0.52s  0.78%  runtime.memmove
     0.48s  0.72% 68.09%      0.48s  0.72%  github.com/influxdata/kapacitor/edge.(*pointMessage).Fields
     0.48s  0.72% 68.81%      1.09s  1.64%  runtime.gcmarknewobject
     0.44s  0.66% 69.47%      0.44s  0.66%  runtime.spanOfUnchecked
     0.42s  0.63% 70.10%      0.42s  0.63%  aeshashbody
     0.39s  0.59% 70.69%      3.51s  5.28%  runtime.wbBufFlush1
     0.36s  0.54% 71.23%      5.38s  8.09%  github.com/influxdata/kapacitor/edge.(*timedForwardReceiver).Point
     0.35s  0.53% 71.76%         1s  1.50%  runtime.mapiternext
     0.33s   0.5% 72.25%      7.40s 11.13%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).Point
     0.33s   0.5% 72.75%      6.15s  9.25%  runtime.greyobject
     0.32s  0.48% 73.23%      1.15s  1.73%  runtime.getitab
     0.31s  0.47% 73.70%      3.36s  5.05%  runtime.findrunnable
     0.30s  0.45% 74.15%      2.87s  4.32%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/influxql.encodeTags
     0.29s  0.44% 74.58%      2.07s  3.11%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).forward
     0.29s  0.44% 75.02%      0.58s  0.87%  runtime.mapaccess1
     0.29s  0.44% 75.45%      1.96s  2.95%  runtime.mapassign_faststr
     0.25s  0.38% 75.83%      0.61s  0.92%  runtime.markBitsForAddr
     0.20s   0.3% 76.13%      5.48s  8.24%  runtime.newobject
     0.20s   0.3% 76.43%      1.06s  1.59%  runtime.runqgrab
     0.19s  0.29% 76.72%      1.21s  1.82%  github.com/influxdata/kapacitor/edge.(*statsEdge).incCollected
     0.19s  0.29% 77.00%      0.84s  1.26%  runtime.bulkBarrierPreWrite
     0.18s  0.27% 77.27%      3.34s  5.02%  github.com/influxdata/kapacitor/edge.(*streamStatsEdge).Collect
     0.15s  0.23% 77.50%     19.80s 29.78%  github.com/influxdata/kapacitor/edge.(*consumer).Consume
     0.15s  0.23% 77.73%      3.20s  4.81%  github.com/influxdata/kapacitor/edge.(*streamStatsEdge).Emit
     0.15s  0.23% 77.95%      0.96s  1.44%  runtime.sellock
     0.14s  0.21% 78.16%      0.35s  0.53%  github.com/influxdata/kapacitor.(*windowTimeBuffer).insert
     0.14s  0.21% 78.37%      0.52s  0.78%  runtime.selunlock
     0.13s   0.2% 78.57%      0.95s  1.43%  runtime.slicebytetostring
     0.13s   0.2% 78.76%      0.96s  1.44%  runtime.typedmemmove
     0.12s  0.18% 78.94%     23.22s 34.92%  runtime.gcDrain
     0.12s  0.18% 79.12%      3.90s  5.87%  runtime.schedule
     0.11s  0.17% 79.29%      2.24s  3.37%  github.com/influxdata/kapacitor.(*windowByTime).Point
     0.11s  0.17% 79.46%      1.44s  2.17%  github.com/influxdata/kapacitor/edge.(*pointMessage).GroupInfo
     0.10s  0.15% 79.61%      1.80s  2.71%  github.com/influxdata/kapacitor.(*TaskMaster).forkPoint
     0.10s  0.15% 79.76%      0.80s  1.20%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.scanKey
     0.10s  0.15% 79.91%      4.20s  6.32%  net.(*UDPConn).readFrom
     0.10s  0.15% 80.06%      0.97s  1.46%  runtime.assertI2I2
     0.10s  0.15% 80.21%      1.70s  2.56%  runtime.makeslice
     0.09s  0.14% 80.34%      0.83s  1.25%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.parseTags
     0.09s  0.14% 80.48%      0.34s  0.51%  runtime.memhash
     0.09s  0.14% 80.61%      1.45s  2.18%  runtime.netpoll
     0.08s  0.12% 80.73%      5.77s  8.68%  github.com/influxdata/kapacitor.(*influxqlGroup).BatchPoint
     0.08s  0.12% 80.85%      1.32s  1.99%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).getOrCreateGroup
     0.08s  0.12% 80.97%      3.42s  5.14%  runtime.gcDrainN
     0.08s  0.12% 81.09%      0.90s  1.35%  runtime.mapiterinit
     0.07s  0.11% 81.20%      1.48s  2.23%  github.com/influxdata/kapacitor.(*alertState).Point
     0.07s  0.11% 81.31%      2.06s  3.10%  github.com/influxdata/kapacitor/edge.(*channelEdge).Collect
     0.07s  0.11% 81.41%      2.02s  3.04%  github.com/influxdata/kapacitor/edge.(*channelEdge).Emit
     0.06s  0.09% 81.50%      1.09s  1.64%  github.com/influxdata/kapacitor.(*StreamNode).runSourceStream
     0.06s  0.09% 81.59%      0.61s  0.92%  github.com/influxdata/kapacitor.(*streamEdge).CollectPoint
     0.06s  0.09% 81.68%      5.32s  8.00%  github.com/influxdata/kapacitor.convertFloatPoint
     0.06s  0.09% 81.77%      0.67s  1.01%  github.com/influxdata/kapacitor/edge.(*statsEdge).incEmitted
     0.06s  0.09% 81.86%      5.14s  7.73%  github.com/influxdata/kapacitor/services/udp.(*Service).serve
     0.06s  0.09% 81.95%      3.62s  5.44%  internal/poll.(*FD).ReadFrom
     0.06s  0.09% 82.04%      4.26s  6.41%  net.(*UDPConn).ReadFromUDP
     0.06s  0.09% 82.13%      0.53s   0.8%  runtime.makemap
     0.05s 0.075% 82.21%      2.83s  4.26%  github.com/influxdata/kapacitor.(*TaskMaster).runForking
     0.05s 0.075% 82.28%      0.90s  1.35%  github.com/influxdata/kapacitor/edge.NewBatchPointMessage
     0.05s 0.075% 82.36%      0.38s  0.57%  github.com/influxdata/kapacitor/models.ToGroupID
     0.05s 0.075% 82.43%      2.11s  3.17%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.ParsePointsWithPrecision
     0.05s 0.075% 82.51%      0.64s  0.96%  runtime.(*mcentral).cacheSpan
     0.05s 0.075% 82.58%      0.39s  0.59%  runtime.strhash
     0.05s 0.075% 82.66%      0.63s  0.95%  sort.Strings
     0.04s  0.06% 82.72%      4.62s  6.95%  github.com/influxdata/kapacitor.(*TaskMaster).WritePoints
     0.04s  0.06% 82.78%      1.66s  2.50%  github.com/influxdata/kapacitor.(*windowTimeBuffer).points
     0.04s  0.06% 82.84%      7.16s 10.77%  github.com/influxdata/kapacitor/services/udp.(*Service).processPackets
     0.04s  0.06% 82.90%      1.24s  1.86%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.Tags.Map
     0.04s  0.06% 82.96%      1.60s  2.41%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.parsePoint
     0.04s  0.06% 83.02%      0.44s  0.66%  runtime.chansend1
     0.04s  0.06% 83.08%     30.96s 46.56%  runtime.systemstack
     0.04s  0.06% 83.14%      0.34s  0.51%  syscall.anyToSockaddr
     0.04s  0.06% 83.20%      2.82s  4.24%  syscall.recvfrom
     0.03s 0.045% 83.25%      0.80s  1.20%  github.com/influxdata/kapacitor.(*FromNode).Point
     0.03s 0.045% 83.29%      0.44s  0.66%  github.com/influxdata/kapacitor.EvalPredicate
     0.03s 0.045% 83.34%      8.25s 12.41%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Point
     0.03s 0.045% 83.38%      0.38s  0.57%  github.com/influxdata/kapacitor/edge.Forward
     0.03s 0.045% 83.43%      0.67s  1.01%  runtime.(*mcache).refill
     0.03s 0.045% 83.47%      1.09s  1.64%  runtime.runqsteal
     0.03s 0.045% 83.52%      0.64s  0.96%  runtime.send
     0.03s 0.045% 83.56%      0.47s  0.71%  runtime.timerproc
     0.03s 0.045% 83.61%      3.19s  4.80%  syscall.Recvfrom
     0.03s 0.045% 83.65%      0.54s  0.81%  text/template.(*state).walk
     0.02s  0.03% 83.68%      0.98s  1.47%  github.com/influxdata/kapacitor.(*streamEdge).EmitPoint
     0.02s  0.03% 83.71%      1.97s  2.96%  github.com/influxdata/kapacitor.(*windowByTime).batch
     0.02s  0.03% 83.74%      0.36s  0.54%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).Barrier
     0.02s  0.03% 83.77%      5.86s  8.81%  github.com/influxdata/kapacitor/edge.(*timedForwardReceiver).BatchPoint
     0.02s  0.03% 83.80%      1.40s  2.11%  github.com/influxdata/kapacitor/edge.BatchPointFromPoint
     0.02s  0.03% 83.83%      3.10s  4.66%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/influxql.NewTags
     0.02s  0.03% 83.86%      0.87s  1.31%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Tags
     0.02s  0.03% 83.89%      2.16s  3.25%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.ParsePoints
     0.02s  0.03% 83.92%      3.65s  5.49%  net.(*netFD).readFrom
     0.02s  0.03% 83.95%      0.42s  0.63%  runtime.chansend
     0.02s  0.03% 83.98%      4.20s  6.32%  runtime.mcall
     0.02s  0.03% 84.01%         4s  6.02%  runtime.park_m
     0.02s  0.03% 84.04%      0.41s  0.62%  runtime.ready
     0.02s  0.03% 84.07%      0.38s  0.57%  runtime.stopm
     0.02s  0.03% 84.10%      3.54s  5.32%  runtime.wbBufFlush
     0.01s 0.015% 84.12%      0.45s  0.68%  github.com/influxdata/kapacitor.(*AlertNode).findFirstMatchLevel
     0.01s 0.015% 84.13%      5.40s  8.12%  github.com/influxdata/kapacitor.(*floatPointAggregator).AggregatePoint
     0.01s 0.015% 84.15%      1.78s  2.68%  github.com/influxdata/kapacitor.(*periodicBarrier).emitBarrier
     0.01s 0.015% 84.16%         2s  3.01%  github.com/influxdata/kapacitor.(*periodicBarrier).periodicEmitter
     0.01s 0.015% 84.18%      6.59s  9.91%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).BufferedBatch
     0.01s 0.015% 84.19%      0.50s  0.75%  github.com/influxdata/kapacitor/edge.NewPointMessage
     0.01s 0.015% 84.21%      6.62s  9.96%  github.com/influxdata/kapacitor/edge.receiveBufferedBatch
     0.01s 0.015% 84.22%      1.11s  1.67%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).unmarshalBinary
     0.01s 0.015% 84.24%      0.75s  1.13%  runtime.(*mcache).nextFree
     0.01s 0.015% 84.25%      0.36s  0.54%  runtime.convTslice
     0.01s 0.015% 84.27%      0.53s   0.8%  runtime.futexsleep
     0.01s 0.015% 84.28%      0.74s  1.11%  runtime.futexwakeup
     0.01s 0.015% 84.30%      0.34s  0.51%  runtime.notesleep
     0.01s 0.015% 84.31%      0.68s  1.02%  runtime.notewakeup
     0.01s 0.015% 84.33%      3.52s  5.29%  runtime.wbBufFlush.func1
         0     0% 84.33%      0.45s  0.68%  github.com/influxdata/kapacitor.(*AlertNode).determineLevel
         0     0% 84.33%      1.02s  1.53%  github.com/influxdata/kapacitor.(*AlertNode).renderID
         0     0% 84.33%      2.42s  3.64%  github.com/influxdata/kapacitor.(*AlertNode).runAlert
         0     0% 84.33%      3.25s  4.89%  github.com/influxdata/kapacitor.(*BarrierNode).runBarrierEmitter
         0     0% 84.33%      2.35s  3.53%  github.com/influxdata/kapacitor.(*FromNode).runStream
         0     0% 84.33%      6.89s 10.36%  github.com/influxdata/kapacitor.(*InfluxQLNode).runInfluxQL
         0     0% 84.33%      2.83s  4.26%  github.com/influxdata/kapacitor.(*TaskMaster).stream.func1
         0     0% 84.33%      4.89s  7.35%  github.com/influxdata/kapacitor.(*WindowNode).runWindow
         0     0% 84.33%     20.89s 31.42%  github.com/influxdata/kapacitor.(*node).start.func1
         0     0% 84.33%      5.86s  8.81%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).BatchPoint
         0     0% 84.33%      0.50s  0.75%  github.com/influxdata/kapacitor/edge.(*forwardingReceiver).EndBatch
         0     0% 84.33%      0.37s  0.56%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Barrier
         0     0% 84.33%     17.45s 26.24%  github.com/influxdata/kapacitor/edge.(*groupedConsumer).Consume
         0     0% 84.33%      0.36s  0.54%  github.com/influxdata/kapacitor/edge.(*pointMessage).ShallowCopy
         0     0% 84.33%      1.12s  1.68%  github.com/influxdata/kapacitor/vendor/github.com/influxdata/influxdb/models.(*point).Fields
         0     0% 84.33%      1.31s  1.97%  runtime.convT2E
         0     0% 84.33%      3.42s  5.14%  runtime.gcAssistAlloc
         0     0% 84.33%      3.42s  5.14%  runtime.gcAssistAlloc.func1
         0     0% 84.33%      3.42s  5.14%  runtime.gcAssistAlloc1
         0     0% 84.33%     23.23s 34.94%  runtime.gcBgMarkWorker
         0     0% 84.33%     23.22s 34.92%  runtime.gcBgMarkWorker.func2
         0     0% 84.33%      0.41s  0.62%  runtime.goready
         0     0% 84.33%      0.41s  0.62%  runtime.goready.func1
         0     0% 84.33%      0.67s  1.01%  runtime.startm
         0     0% 84.33%      0.51s  0.77%  runtime.wakep
         0     0% 84.33%      0.64s  0.96%  text/template.(*Template).Execute
         0     0% 84.33%      0.64s  0.96%  text/template.(*Template).execute

So I'm left head scratching of running through many iterations of changes to the tick scripts. The data, while constantly flowing the Kapacitor doesn't cause memory budge from about a gigabyte if all the tick scripts are disabled / inhibited from processing.

A few questions:

Am I correctly configuring Kapacitor to do what I want here for data directly sent to it, that we don't care about at all past the window (again, 1-2 minutes)? That is, the only way to purge data once you open a stream and consume from it is to have barrier events remove them -- this seems to be adding high CPU load and potentially accounting for memory when you get over a size of a certain number, even if cardinality is in check.
Configuring the barrier to emit delete on idle causes quicker growth, a steeper graph, but that's probably because it's not the option for that node we want ... there will essentially never be an idle period in the data Kapacitor receives for these measurements in Production. That said, a lot of time and overhead seem to be spent with the tickers responsible for purging the data.
The carnality does fluctuate and stay within a reasonable size when a tick script is example, as the various aforementioned outputs show. So why the growth till OOM of Kapacitor?
Though sum doesn't really seem to be anywhere near the problem with this configuration and the growth issue, is there a cleaner way of identifying the count of points in the stream? Is it possible for me to drop the value count=1 given that's the only purpose it serves?
Retention in this situation -- where data never interacts with InfluxDB in any way -- is moot, no where in Kapacitor is retention honored, exhibited by the fact that Kapacitor allows you to place already-existing-in-Influx retention periods on data in Kapacitor. Is that right?

I've attached (top_combined_and_top_ip_and_store_kapacitor_1.5.7_growth.tar.gz) the various dumps / profile captures, etc... from the time each of these steps were in place, an output of the structure is:

tree -D    
.
├── top_combined
│   ├── [Jan 25 19:31]  allocs?debug=1
│   ├── [Jan 25 19:31]  goroutine?debug=1
│   ├── [Jan 25 19:33]  goroutine?debug=2
│   ├── [Jan 25 19:31]  heap?debug=1
│   ├── [Jan 25 22:52]  index.html
│   ├── [Jan 25 19:32]  mutex?debug=1
│   ├── [Jan 25 19:32]  profile
│   ├── [Jan 25 19:32]  show_top_combined
│   ├── [Jan 25 19:32]  threadcreate?debug=1
│   └── [Jan 25 19:32]  trace
├── top_ip_and_store_id
│   ├── [Jan 27 11:26]  allocs?debug=1
│   ├── [Jan 27 11:23]  goroutine?debug=1
│   ├── [Jan 27 11:23]  goroutine?debug=2
│   ├── [Jan 27 11:25]  heap?debug=1
│   ├── [Jan 27 11:26]  index.html
│   ├── [Jan 27 11:24]  mutex?debug=1
│   ├── [Jan 27 11:25]  profile
│   ├── [Jan 27 11:27]  show_top_ips
│   ├── [Jan 27 11:27]  show_top_stores
│   ├── [Jan 27 11:26]  threadcreate?debug=1
│   └── [Jan 27 11:24]  trace
└── top_ip_and_store_id_last
    ├── [Jan 27 14:18]  allocs?debug=1
    ├── [Jan 27 14:19]  goroutine?debug=1
    ├── [Jan 27 14:19]  goroutine?debug=2
    ├── [Jan 27 14:19]  heap?debug=1
    ├── [Jan 27 14:24]  index.html
    ├── [Jan 27 14:19]  mutex?debug=1
    ├── [Jan 27 14:20]  profile
    ├── [Jan 27 14:20]  threadcreate?debug=1
    └── [Jan 27 14:20]  trace

3 directories, 30 files

With the files in top_ip_and_store_id and top_ip_and_store_id_last taken a few hours apart as the ^^^ shows.

docmerlin commented 3 years ago

If you don't care about data outside the window then your alert should stop keeping data so old, right now it keeps quite a bit of data in its history to prevent flapping etc.

You don't need a barrier, unless new data for a tagset stops coming in.

Are you using docker?

The carnality does fluctuate and stay within a reasonable size when a tick script is example, as the various aforementioned outputs show. So why the growth till OOM of Kapacitor?

I don't know for your example.

Since count is always 1, you should use a |count() instead of your |sum("count")

Your cardinality isn't insane, so you shouldn't be using anywhere near that much memory. We did slightly reduce memory pressure in 1.5.8, an upgrade may mitigate your problem signicantly. Also, if that doesn't work try adding the environmental variable GODEBUG=madvdontneed=1

Good luck, if this doesn't work, come back to me and we can try deeper debugging

The change in 1.5.8 significantly reduced memory pressure from github.com/influxdata/kapacitor/edge.(*pointMessage).GroupInfo

Good luck, if this doesn't work, feel free to poke me again.

joshdurbin commented 3 years ago

Will do. Will try and make some changes in the next few days and report back. Cheers!

joshdurbin commented 3 years ago

Since posting this we dialed back our monitoring to three tick scripts with a single measurement each:

ip
store ID
user agent

with scripts that look like:

dbrp "toptraffic"."autogen"

var windowedStream = stream
  |from()
    .groupBy('ip')
    .measurement('ips')
  |barrier()
    .period(2m)
    .delete(TRUE)
  |window()
    .period(2m)
    .every(5s)
    .align()

var storelbStream = windowedStream
  |where(lambda: "role" == 'storelb')
  |sum('count')
    .as('totalCount')

var sharedlbStream = windowedStream
  |where(lambda: "role" == 'sharedlb')
  |sum('count')
    .as('totalCount')

var joinStream = storelbStream
  |join(sharedlbStream)
    .as('storelb','sharedlb')
    .tolerance(1s)

joinStream
  |alert()
    .flapping(0.25, 0.5)
    .history(21)
    .warn(lambda: "storelb.totalCount" > 75000 OR "sharedlb.totalCount" > 75000 OR "storelb.totalCount" + "sharedlb.totalCount" > 75000)
    .crit(lambda: "storelb.totalCount" > 125000 OR "sharedlb.totalCount" > 125000 OR "storelb.totalCount" + "sharedlb.totalCount" > 125000)
    .message('''Observed a high rate of request volume in Production for IP {{ index .Tags "ip" }}. Requests within a 2m window at the Store LB are: {{ index .Fields "storelb.totalCount" }} and Shared LB: {{ index .Fields "sharedlb.totalCount" }}.''')
    .stateChangesOnly(2m)
    .slack()
    .channel('#ops-noise')
    .noRecoveries()

joinStream
  |alert()
    .flapping(0.25, 0.5)
    .history(21)
    .warn(lambda: "storelb.totalCount" > 75000 OR "sharedlb.totalCount" > 75000 OR "storelb.totalCount" + "sharedlb.totalCount" > 75000)
    .crit(lambda: "storelb.totalCount" > 125000 OR "sharedlb.totalCount" > 125000 OR "storelb.totalCount" + "sharedlb.totalCount" > 125000)
    .message('''Observed a high rate of request volume in Production for IP {{ index .Tags "ip" }}. Requests within a 2m window at the Store LB are: {{ index .Fields "storelb.totalCount" }} and Shared LB: {{ index .Fields "sharedlb.totalCount" }}.''')
    .stateChangesOnly(2m)
    .exec('/usr/bin/kapacitor_pubsub_stdin_invoker.sh')

We thought we were holding steady at around 5GB memory usage, but over 8-9 days we crept up to 9.5 or so.

We deployed the changes you recommended (excluding the count change) altering the script to look like:

dbrp "toptraffic"."autogen"

var windowedStream = stream
  |from()
    .groupBy('ip')
    .measurement('ips')
  |window()
    .period(2m)
    .every(5s)
    .align()

var storelbStream = windowedStream
  |where(lambda: "role" == 'storelb')
  |sum('count')
    .as('totalCount')

var sharedlbStream = windowedStream
  |where(lambda: "role" == 'sharedlb')
  |sum('count')
    .as('totalCount')

var joinStream = storelbStream
  |join(sharedlbStream)
    .as('storelb','sharedlb')
    .tolerance(1s)

joinStream
  |alert()
    .warn(lambda: "storelb.totalCount" > 75000 OR "sharedlb.totalCount" > 75000 OR "storelb.totalCount" + "sharedlb.totalCount" > 75000)
    .crit(lambda: "storelb.totalCount" > 125000 OR "sharedlb.totalCount" > 125000 OR "storelb.totalCount" + "sharedlb.totalCount" > 125000)
    .message('''Observed a high rate of request volume in Production for IP {{ index .Tags "ip" }}. Requests within a 2m window at the Store LB are: {{ index .Fields "storelb.totalCount" }} and Shared LB: {{ index .Fields "sharedlb.totalCount" }}.''')
    .slack()
    .channel('#ops-noise')
    .noRecoveries()

joinStream
  |alert()
    .warn(lambda: "storelb.totalCount" > 75000 OR "sharedlb.totalCount" > 75000 OR "storelb.totalCount" + "sharedlb.totalCount" > 75000)
    .crit(lambda: "storelb.totalCount" > 125000 OR "sharedlb.totalCount" > 125000 OR "storelb.totalCount" + "sharedlb.totalCount" > 125000)
    .message('''Observed a high rate of request volume in Production for IP {{ index .Tags "ip" }}. Requests within a 2m window at the Store LB are: {{ index .Fields "storelb.totalCount" }} and Shared LB: {{ index .Fields "sharedlb.totalCount" }}.''')
    .exec('/usr/bin/kapacitor_pubsub_stdin_invoker.sh')

...which produced rapid memory utilization up to about 35.5GB over 5 1/2 hours:

kapacitor show top_ips
ID: top_ips
Error: 
Template: 
Type: stream
Status: enabled
Executing: true
Created: 22 Jan 21 22:25 UTC
Modified: 11 Feb 21 20:52 UTC
LastEnabled: 11 Feb 21 20:52 UTC
Databases Retention Policies: ["toptraffic"."autogen"]
TICKscript:
dbrp "toptraffic"."autogen"

var windowedStream = stream
    |from()
        .groupBy('ip')
        .measurement('ips')
    |window()
        .period(2m)
        .every(5s)
        .align()

var storelbStream = windowedStream
    |where(lambda: "role" == 'storelb')
    |sum('count')
        .as('totalCount')

var sharedlbStream = windowedStream
    |where(lambda: "role" == 'sharedlb')
    |sum('count')
        .as('totalCount')

var joinStream = storelbStream
    |join(sharedlbStream)
        .as('storelb', 'sharedlb')
        .tolerance(1s)

joinStream
    |alert()
        .warn(lambda: "storelb.totalCount" > 75000 OR "sharedlb.totalCount" > 75000 OR "storelb.totalCount" + "sharedlb.totalCount" > 75000)
        .crit(lambda: "storelb.totalCount" > 125000 OR "sharedlb.totalCount" > 125000 OR "storelb.totalCount" + "sharedlb.totalCount" > 125000)
        .message('''Observed a high rate of request volume in Production for IP {{ index .Tags "ip" }}. Requests within a 2m window at the Store LB are: {{ index .Fields "storelb.totalCount" }} and Shared LB: {{ index .Fields "sharedlb.totalCount" }}.''')
        .slack()
        .channel('#ops-noise')
        .noRecoveries()

joinStream
    |alert()
        .warn(lambda: "storelb.totalCount" > 75000 OR "sharedlb.totalCount" > 75000 OR "storelb.totalCount" + "sharedlb.totalCount" > 75000)
        .crit(lambda: "storelb.totalCount" > 125000 OR "sharedlb.totalCount" > 125000 OR "storelb.totalCount" + "sharedlb.totalCount" > 125000)
        .message('''Observed a high rate of request volume in Production for IP {{ index .Tags "ip" }}. Requests within a 2m window at the Store LB are: {{ index .Fields "storelb.totalCount" }} and Shared LB: {{ index .Fields "sharedlb.totalCount" }}.''')
        .exec('/usr/bin/kapacitor_pubsub_stdin_invoker.sh')

DOT:
digraph top_ips {
graph [throughput="5019.95 points/s"];

stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ];
stream0 -> from1 [processed="94807195"];

from1 [avg_exec_time_ns="47.005µs" errors="0" working_cardinality="0" ];
from1 -> window2 [processed="94807195"];

window2 [avg_exec_time_ns="606.918µs" errors="0" working_cardinality="60439" ];
window2 -> where5 [processed="5125353"];
window2 -> where3 [processed="5125353"];

where5 [avg_exec_time_ns="161.103µs" errors="0" working_cardinality="50115" ];
where5 -> sum6 [processed="5124931"];

sum6 [avg_exec_time_ns="242.923µs" errors="0" working_cardinality="50115" ];
sum6 -> join8 [processed="5124931"];

where3 [avg_exec_time_ns="47.721µs" errors="0" working_cardinality="50115" ];
where3 -> sum4 [processed="5125162"];

sum4 [avg_exec_time_ns="91.395µs" errors="0" working_cardinality="50115" ];
sum4 -> join8 [processed="5125159"];

join8 [avg_exec_time_ns="24.861µs" errors="0" working_cardinality="50115" ];
join8 -> alert10 [processed="5124931"];
join8 -> alert9 [processed="5124931"];

alert10 [alerts_inhibited="0" alerts_triggered="0" avg_exec_time_ns="52.841µs" crits_triggered="0" errors="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" working_cardinality="50115" ];

alert9 [alerts_inhibited="0" alerts_triggered="0" avg_exec_time_ns="26.064µs" crits_triggered="0" errors="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" working_cardinality="50115" ];
}

I'll try again with the memory tuning parameter.

joshdurbin commented 3 years ago

Now that we're a few hours out, here'a a memory graph with a wider timeline than the other two in the prior comment.

The graph starts with the tick scripts in place without the changes mentioned. The peaks after the yellow arrow indicate me making various changes to the alert node, dropping history/flapping/state change tracking and the barrier nodes. The red arrow is where we revert to the tick scripts that are in place prior to the yellow arrow.

All of this seems to further indicate that if you're processing a stream of data within a window that point node TTL, if you will, does not seem to honor / evaluate / whatever the max window size. Again, in this example, I don't care about any point data over 2m and it can be expunged. Even though there is growth, it does seem that the barrier node helps inhibit (rapid) memory consumption.

joshdurbin commented 3 years ago

@docmerlin any thoughts on those updates ^^^ ?

docmerlin commented 3 years ago

Yes, we found several memory leaks in JoinNode and UnionNode and are fixing them now, That being said it is still possible for memory to grow depending on your data cardinality.

docmerlin commented 3 years ago

@joshdurbin here is a PR to fix some of the leakage. https://github.com/influxdata/kapacitor/pull/2489

joshdurbin commented 3 years ago

As those merged PRs expected in the next release?

joshdurbin commented 3 years ago

And, yeah, we see that the cardinality controls aren't passing through he join node and the dependent nodes on our end too, in these use cases:

ID: top_ips
Error: 
Template: 
Type: stream
Status: enabled
Executing: true
Created: 22 Jan 21 22:25 UTC
Modified: 12 Feb 21 01:40 UTC
LastEnabled: 12 Feb 21 01:40 UTC
Databases Retention Policies: ["toptraffic"."autogen"]
TICKscript:
dbrp "toptraffic"."autogen"

var windowedStream = stream
    |from()
        .groupBy('ip')
        .measurement('ips')
    |barrier()
        .period(2m)
        .delete(TRUE)
    |window()
        .period(2m)
        .every(5s)
        .align()

var storelbStream = windowedStream
    |where(lambda: "role" == 'storelb')
    |sum('count')
        .as('totalCount')

var sharedlbStream = windowedStream
    |where(lambda: "role" == 'sharedlb')
    |sum('count')
        .as('totalCount')

var joinStream = storelbStream
    |join(sharedlbStream)
        .as('storelb', 'sharedlb')
        .tolerance(1s)

joinStream
    |alert()
        .flapping(0.25, 0.5)
        .history(21)
        .warn(lambda: "storelb.totalCount" > 75000 OR "sharedlb.totalCount" > 75000 OR "storelb.totalCount" + "sharedlb.totalCount" > 75000)
        .crit(lambda: "storelb.totalCount" > 125000 OR "sharedlb.totalCount" > 125000 OR "storelb.totalCount" + "sharedlb.totalCount" > 125000)
        .message('''Observed a high rate of request volume in Production for IP {{ index .Tags "ip" }}. Requests within a 2m window at the Store LB are: {{ index .Fields "storelb.totalCount" }} and Shared LB: {{ index .Fields "sharedlb.totalCount" }}.''')
        .stateChangesOnly(2m)
        .slack()
        .channel('#ops-noise')
        .noRecoveries()

joinStream
    |alert()
        .flapping(0.25, 0.5)
        .history(21)
        .warn(lambda: "storelb.totalCount" > 75000 OR "sharedlb.totalCount" > 75000 OR "storelb.totalCount" + "sharedlb.totalCount" > 75000)
        .crit(lambda: "storelb.totalCount" > 125000 OR "sharedlb.totalCount" > 125000 OR "storelb.totalCount" + "sharedlb.totalCount" > 125000)
        .message('''Observed a high rate of request volume in Production for IP {{ index .Tags "ip" }}. Requests within a 2m window at the Store LB are: {{ index .Fields "storelb.totalCount" }} and Shared LB: {{ index .Fields "sharedlb.totalCount" }}.''')
        .stateChangesOnly(2m)
        .exec('/usr/bin/kapacitor_pubsub_stdin_invoker.sh')

DOT:
digraph top_ips {
graph [throughput="4891.68 points/s"];

stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ];
stream0 -> from1 [processed="4940426264"];

from1 [avg_exec_time_ns="111.079µs" errors="0" working_cardinality="0" ];
from1 -> barrier2 [processed="4940426264"];

barrier2 [avg_exec_time_ns="38.583µs" errors="0" working_cardinality="7172" ];
barrier2 -> window3 [processed="4940418579"];

window3 [avg_exec_time_ns="685.66µs" errors="0" working_cardinality="7172" ];
window3 -> where6 [processed="247104855"];
window3 -> where4 [processed="247104855"];

where6 [avg_exec_time_ns="177.822µs" errors="0" working_cardinality="2728" ];
where6 -> sum7 [processed="247104855"];

sum7 [avg_exec_time_ns="209.806µs" errors="0" working_cardinality="2728" ];
sum7 -> join9 [processed="247104855"];

where4 [avg_exec_time_ns="862.96µs" errors="0" working_cardinality="2728" ];
where4 -> sum5 [processed="247104855"];

sum5 [avg_exec_time_ns="218.497µs" errors="0" working_cardinality="2728" ];
sum5 -> join9 [processed="247104855"];

join9 [avg_exec_time_ns="919.745µs" errors="0" working_cardinality="749692" ];
join9 -> alert11 [processed="247104855"];
join9 -> alert10 [processed="247104855"];

alert11 [alerts_inhibited="0" alerts_triggered="73" avg_exec_time_ns="50.693µs" crits_triggered="1" errors="0" infos_triggered="0" oks_triggered="36" warns_triggered="36" working_cardinality="749692" ];

alert10 [alerts_inhibited="0" alerts_triggered="37" avg_exec_time_ns="261.104µs" crits_triggered="1" errors="0" infos_triggered="0" oks_triggered="0" warns_triggered="36" working_cardinality="749692" ];
}

joshdurbin commented 3 years ago

I don't believe this comment on an issue from 2016 is correct. From what I can tell there is no situation, once a stream has been opened, where memory doesn't build to an OOM scenario.

dbrp "toptraffic"."autogen"

var traffic = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')

var sharedLBTraffic = traffic
    |where(lambda: "src" == 1)
    |sum('src')
        .as('total')

var storeLBTraffic = traffic
    |where(lambda: "src" == 2)
    |sum('src')
        .as('total')

##############

First is:

dbrp "toptraffic"."autogen"

var traffic = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')

Second is:

dbrp "toptraffic"."autogen"

var traffic = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')

var sharedLBTraffic = traffic
    |where(lambda: "src" == 1)
    |sum('src')
        .as('total')

Third is:

dbrp "toptraffic"."autogen"

var traffic = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')

traffic
    |where(lambda: "src" == 1)
    |sum('src')
        .as('total')

Next up, test 4, is (ruling out discrepancy between sum and count:

dbrp "toptraffic"."autogen"

var traffic = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')

traffic
    |where(lambda: "src" == 1)
    |count('src')
        .as('total')

Fifth test is reverting the count/sum thing and adding a 30 second barrier configured to delete:

dbrp "toptraffic"."autogen"

var traffic = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')
    |barrier()
        .period(30s)
        .delete(TRUE)

traffic
    |where(lambda: "src" == 1)
    |sum('src')
        .as('total')

This does show a smoothing of memory usage.

Next up, sixth test/change is adding the counts to the other monitored system:

Tick script omitted here as its show in the exec outputs below.

We can confirm that the barrier node is reducing cardinality with two subsequent runs via kapacitor show ips_and_stores:

ID: ips_and_stores
Error: 
Template: 
Type: stream
Status: enabled
Executing: true
Created: 11 Mar 21 18:05 UTC
Modified: 11 Mar 21 20:06 UTC
LastEnabled: 11 Mar 21 20:06 UTC
Databases Retention Policies: ["toptraffic"."autogen"]
TICKscript:
dbrp "toptraffic"."autogen"

var traffic = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')
    |barrier()
        .period(30s)
        .delete(TRUE)

var storeLBTraffic = traffic
    |where(lambda: "src" == 1)
    |sum('src')
        .as('total')

var sharedLBTraffic = traffic
    |where(lambda: "src" == 2)
    |sum('src')
        .as('total')

DOT:
digraph ips_and_stores {
graph [throughput="14576.14 points/s"];

stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ];
stream0 -> from1 [processed="522683"];

from1 [avg_exec_time_ns="48.437µs" errors="0" working_cardinality="0" ];
from1 -> barrier2 [processed="522683"];

barrier2 [avg_exec_time_ns="2.28µs" errors="0" working_cardinality="84561" ];
barrier2 -> where5 [processed="522683"];
barrier2 -> where3 [processed="522683"];

where5 [avg_exec_time_ns="3.735µs" errors="0" working_cardinality="84561" ];
where5 -> sum6 [processed="167322"];

sum6 [avg_exec_time_ns="14.734µs" errors="0" working_cardinality="7337" ];

where3 [avg_exec_time_ns="3.354µs" errors="0" working_cardinality="84561" ];
where3 -> sum4 [processed="355361"];

sum4 [avg_exec_time_ns="12.121µs" errors="0" working_cardinality="77277" ];
}

ID: ips_and_stores
Error: 
Template: 
Type: stream
Status: enabled
Executing: true
Created: 11 Mar 21 18:05 UTC
Modified: 11 Mar 21 20:06 UTC
LastEnabled: 11 Mar 21 20:06 UTC
Databases Retention Policies: ["toptraffic"."autogen"]
TICKscript:
dbrp "toptraffic"."autogen"

var traffic = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')
    |barrier()
        .period(30s)
        .delete(TRUE)

var storeLBTraffic = traffic
    |where(lambda: "src" == 1)
    |sum('src')
        .as('total')

var sharedLBTraffic = traffic
    |where(lambda: "src" == 2)
    |sum('src')
        .as('total')

DOT:
digraph ips_and_stores {
graph [throughput="16238.66 points/s"];

stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ];
stream0 -> from1 [processed="801608"];

from1 [avg_exec_time_ns="48.437µs" errors="0" working_cardinality="0" ];
from1 -> barrier2 [processed="801608"];

barrier2 [avg_exec_time_ns="2.28µs" errors="0" working_cardinality="73055" ];
barrier2 -> where5 [processed="801608"];
barrier2 -> where3 [processed="801608"];

where5 [avg_exec_time_ns="3.735µs" errors="0" working_cardinality="73054" ];
where5 -> sum6 [processed="254519"];

sum6 [avg_exec_time_ns="14.734µs" errors="0" working_cardinality="6920" ];

where3 [avg_exec_time_ns="3.354µs" errors="0" working_cardinality="73055" ];
where3 -> sum4 [processed="547089"];

sum4 [avg_exec_time_ns="12.121µs" errors="0" working_cardinality="66199" ];
}

Seventh test/change is joining those two count streams where we see that the cardinality of the join node is never reduced.

ID: ips_and_stores
Error: 
Template: 
Type: stream
Status: enabled
Executing: true
Created: 11 Mar 21 18:05 UTC
Modified: 11 Mar 21 20:12 UTC
LastEnabled: 11 Mar 21 20:12 UTC
Databases Retention Policies: ["toptraffic"."autogen"]
TICKscript:
dbrp "toptraffic"."autogen"

var traffic = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')
    |barrier()
        .period(30s)
        .delete(TRUE)

var storeLBTraffic = traffic
    |where(lambda: "src" == 1)
    |sum('src')
        .as('total')

var sharedLBTraffic = traffic
    |where(lambda: "src" == 2)
    |sum('src')
        .as('total')

var joinStream = storeLBTraffic
    |join(sharedLBTraffic)
        .as('storelb', 'sharedlb')
        .tolerance(1s)

DOT:
digraph ips_and_stores {
graph [throughput="15907.77 points/s"];

stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ];
stream0 -> from1 [processed="6765340"];

from1 [avg_exec_time_ns="9.93µs" errors="0" working_cardinality="0" ];
from1 -> barrier2 [processed="6765342"];

barrier2 [avg_exec_time_ns="19.601µs" errors="0" working_cardinality="79006" ];
barrier2 -> where5 [processed="6765291"];
barrier2 -> where3 [processed="6765291"];

where5 [avg_exec_time_ns="5.58µs" errors="0" working_cardinality="79006" ];
where5 -> sum6 [processed="2127062"];

sum6 [avg_exec_time_ns="27.531µs" errors="0" working_cardinality="6877" ];
sum6 -> join8 [processed="2031335"];

where3 [avg_exec_time_ns="7.545µs" errors="0" working_cardinality="79006" ];
where3 -> sum4 [processed="4638228"];

sum4 [avg_exec_time_ns="32.777µs" errors="0" working_cardinality="72199" ];
sum4 -> join8 [processed="3627725"];

join8 [avg_exec_time_ns="19.833µs" errors="0" working_cardinality="627063" ];
}

joshdurbin commented 3 years ago

So this all looks due to the sum and count operations on nodes.

The next iteration of something I'm trying to avoid the join is:

dbrp "toptraffic"."autogen"

var traffic = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')
    |barrier()
        .period(120s)
        .delete(TRUE)
    |stateCount(lambda: "src" == 1)
        .as('total_store_lb')
    |stateCount(lambda: "src" == 2)
        .as('total_shared_lb')
    |eval(lambda: if("total_store_lb" == -1, 0, "total_store_lb"), lambda: if("total_shared_lb" == -1, 0, "total_shared_lb"))
        .as('unsigned_total_store_lb','unsigned_total_shared_lb')
        .keep('total_store_lb', 'total_shared_lb', 'unsigned_total_store_lb', 'unsigned_total_shared_lb')
    |alert()
        .warn(lambda: "unsigned_total_store_lb" > 2500 OR "unsigned_total_shared_lb" > 2500 OR "unsigned_total_store_lb" + "unsigned_total_shared_lb" > 2500)
        .message('Observed a high rate of request volume from IP {{ index .Tags "ip" }} for Store ID {{ index .Tags "id" }} at the Shared LB: {{ index .Fields "unsigned_total_shared_lb" }} and the Store LB: {{ index .Fields "unsigned_total_store_lb" }}')
        .stateChangesOnly(2m)
        .log('/var/log/kapacitor/test.log')

It would be nice if we could change the "not found" value for the stateCount node. Of course, I could guard the stateCount node with a where cause, but then I have to split (considering I have two conditionals here) -- I need to count all those where the field src has the value 1 and 2, independently.

This script, does, produce large memory usage for Kapacitor, but has stabilized -- probably due to cardinality, which, is "high". The dot graph for about 30 minutes of data flow randomly in a day is:

DOT:
digraph ips_and_stores {
graph [throughput="15066.77 points/s"];

stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ];
stream0 -> from1 [processed="34578588"];

from1 [avg_exec_time_ns="16.657µs" errors="0" working_cardinality="0" ];
from1 -> barrier2 [processed="34578588"];

barrier2 [avg_exec_time_ns="13.455µs" errors="0" working_cardinality="195103" ];
barrier2 -> state_count3 [processed="34578456"];

state_count3 [avg_exec_time_ns="36.458µs" errors="0" working_cardinality="195103" ];
state_count3 -> state_count4 [processed="34578456"];

state_count4 [avg_exec_time_ns="31.051µs" errors="0" working_cardinality="195103" ];
state_count4 -> eval5 [processed="34578456"];

eval5 [avg_exec_time_ns="56.633µs" errors="0" working_cardinality="195103" ];
eval5 -> alert6 [processed="34578456"];

alert6 [alerts_inhibited="0" alerts_triggered="553" avg_exec_time_ns="86.608µs" crits_triggered="0" errors="0" infos_triggered="0" oks_triggered="272" warns_triggered="281" working_cardinality="195103" ];

On second look though at the data coming out of the stream here, I don't think this is doing what I want.

joshdurbin commented 3 years ago

As suggested in this thread earlier, the alert() node takes considerable memory with its configured parameters to suppress x minutes of state changes, history, flapping, etc... Particularly since we know that barrier node signals don't make it through the join node at the moment.

Example tick scripts:

First round

dbrp "toptraffic"."autogen"

var trafficByIPAndStoreIDStream = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')
    |barrier()
        .period(121s)
        // the period for the barrier should be +1 unit over the period defined in the downstream window
        .delete(TRUE)
    |window()
        .period(120s)
        .every(5s)
    |default()
        // count is not set the filebeat-udp messenger, so we add it to each point coming through with a value of 1
        .field('count', 1)

var storeLBTrafficStream = trafficByIPAndStoreIDStream
    |where(lambda: "src" == 1)
    // the filebeat-udp message source sets the `src` field key to the value 1 for the store LB
    |sum('count')
        // we sum on count here and not upstream because this operation drops all other fields from the forward point
        .as('totalCount')

var sharedLBTrafficStream = trafficByIPAndStoreIDStream
    |where(lambda: "src" == 2)
    // the filebeat-udp message source sets the `src` field key to the value 2 for the shared LB
    |sum('count')
        // see the other sum message ^^^
        .as('totalCount')

var joinedTrafficStream = storeLBTrafficStream
    |join(sharedLBTrafficStream)
        .as('storelb', 'sharedlb')

joinedTrafficStream
    |alert()
        .warn(lambda: int("storelb.totalCount") > 50000 OR int("sharedlb.totalCount") > 50000 OR int("storelb.totalCount") + int("sharedlb.totalCount") > 50000)
        .crit(lambda: int("storelb.totalCount") > 75000 OR int("sharedlb.totalCount") > 75000 OR int("storelb.totalCount") + int("sharedlb.totalCount") > 75000)
        .message('Production IP {{ index .Tags "ip" }} has exceeded request thresholds for the Store ID {{ index .Tags "id" }} via the Shared LB: {{ index .Fields "sharedlb.totalCount" }} and the Store LB: {{ index .Fields "storelb.totalCount" }} within the last 2 minutes.')
        .stateChangesOnly(5m)
        .noRecoveries()
        .slack()
        .channel('#ops-noise')

joinedTrafficStream
    |alert()
        .warn(lambda: int("storelb.totalCount") > 50000 OR int("sharedlb.totalCount") > 50000 OR int("storelb.totalCount") + int("sharedlb.totalCount") > 50000)
        .crit(lambda: int("storelb.totalCount") > 75000 OR int("sharedlb.totalCount") > 75000 OR int("storelb.totalCount") + int("sharedlb.totalCount") > 75000)
        .message('Production IP {{ index .Tags "ip" }} has exceeded request thresholds for the Store ID {{ index .Tags "id" }} via the Shared LB: {{ index .Fields "sharedlb.totalCount" }} and the Store LB: {{ index .Fields "storelb.totalCount" }} within the last 2 minutes.')
        .stateChangesOnly(5m)
        .exec('/usr/bin/kapacitor_pubsub_stdin_invoker.sh')

Second Round

Here the join stream remains and the alert nodes are removed.

dbrp "toptraffic"."autogen"

var trafficByIPAndStoreIDStream = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')
    |barrier()
        .period(121s)
        // the period for the barrier should be +1 unit over the period defined in the downstream window
        .delete(TRUE)
    |window()
        .period(120s)
        .every(5s)
    |default()
        // count is not set the filebeat-udp messenger, so we add it to each point coming through with a value of 1
        .field('count', 1)

var storeLBTrafficStream = trafficByIPAndStoreIDStream
    |where(lambda: "src" == 1)
    // the filebeat-udp message source sets the `src` field key to the value 1 for the store LB
    |sum('count')
        // we sum on count here and not upstream because this operation drops all other fields from the forward point
        .as('totalCount')

var sharedLBTrafficStream = trafficByIPAndStoreIDStream
    |where(lambda: "src" == 2)
    // the filebeat-udp message source sets the `src` field key to the value 2 for the shared LB
    |sum('count')
        // see the other sum message ^^^
        .as('totalCount')

var joinedTrafficStream = storeLBTrafficStream
    |join(sharedLBTrafficStream)
        .as('storelb', 'sharedlb')

Third Round

Here the join stream is completely removed.

dbrp "toptraffic"."autogen"

var trafficByIPAndStoreIDStream = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')
    |barrier()
        .period(121s)
        // the period for the barrier should be +1 unit over the period defined in the downstream window
        .delete(TRUE)
    |window()
        .period(120s)
        .every(5s)
    |default()
        // count is not set the filebeat-udp messenger, so we add it to each point coming through with a value of 1
        .field('count', 1)

var storeLBTrafficStream = trafficByIPAndStoreIDStream
    |where(lambda: "src" == 1)
    // the filebeat-udp message source sets the `src` field key to the value 1 for the store LB
    |sum('count')
        // we sum on count here and not upstream because this operation drops all other fields from the forward point
        .as('totalCount')

var sharedLBTrafficStream = trafficByIPAndStoreIDStream
    |where(lambda: "src" == 2)
    // the filebeat-udp message source sets the `src` field key to the value 2 for the shared LB
    |sum('count')
        // see the other sum message ^^^
        .as('totalCount')

What I settled on, for now, is this:

dbrp "toptraffic"."autogen"

var trafficByIPAndStoreIDStream = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')
    |barrier()
        .period(121s)
        // the period for the barrier should be +1 unit over the period defined in the downstream window
        .delete(TRUE)
    |window()
        .period(120s)
        .every(5s)
    |default()
        // count is not set the filebeat-udp messenger, so we add it to each point coming through with a value of 1
        .field('count', 1)

var storeLBTrafficStream = trafficByIPAndStoreIDStream
    |where(lambda: "src" == 1)
    // the filebeat-udp message source sets the `src` field key to the value 1 for the store LB
    |sum('count')
        // we sum on count here and not upstream because this operation drops all other fields from the forward point
        .as('totalCount')

var sharedLBTrafficStream = trafficByIPAndStoreIDStream
    |where(lambda: "src" == 2)
    // the filebeat-udp message source sets the `src` field key to the value 2 for the shared LB
    |sum('count')
        // see the other sum message ^^^
        .as('totalCount')

var joinedTrafficStream = storeLBTrafficStream
    |join(sharedLBTrafficStream)
        .as('storelb', 'sharedlb')

joinedTrafficStream
    |alert()
        .warn(lambda: int("storelb.totalCount") > 50000 OR int("sharedlb.totalCount") > 50000 OR int("storelb.totalCount") + int("sharedlb.totalCount") > 50000)
        .crit(lambda: int("storelb.totalCount") > 75000 OR int("sharedlb.totalCount") > 75000 OR int("storelb.totalCount") + int("sharedlb.totalCount") > 75000)
        .message('Production IP {{ index .Tags "ip" }} has exceeded request thresholds for the Store ID {{ index .Tags "id" }} via the Shared LB: {{ index .Fields "sharedlb.totalCount" }} and the Store LB: {{ index .Fields "storelb.totalCount" }} within the last 2 minutes.')
        .stateChangesOnly(5m)
        .exec('/usr/bin/kapacitor_pubsub_stdin_invoker.sh')
        .slack()
        .channel('#ops-noise')

Which combines the alert nodes into one -- I had them broken out purely because I wanted recoveries to goto our pubsub channel but not to slack. This reduces the memory growth a bit. The bandaid I'll apply to this is service restarts at memory thresholds via systemd, I think.

That said, getting the barrier emission through the join node, the memory leak with join, etc... released should help.

joshdurbin commented 3 years ago

The current iteration drops the usage of default node and the explicit type versions required because of it, or the combination of it and the sum operations. It's pretty clear, though, that we're not able to remove or reduce our cardinality and define the TTL on the points as they flow through the system, really, in this case, making the system far less useful. That has, fortunately, been mitigated as suggested in a number of forums by forcing restarts at high water marks for memory.

dbrp "toptraffic"."autogen"

var trafficByIPAndStoreIDStream = stream
    |from()
        .groupBy('ip', 'id')
        .measurement('ips_and_stores')
    |barrier()
        .period(121s)
        // the period for the barrier should be +1 unit over the period defined in the downstream window
        .delete(TRUE)
    |window()
        .period(120s)
        .every(5s)

var storeLBTrafficStream = trafficByIPAndStoreIDStream
    |where(lambda: "source" == 'storelb')
    |sum('count')
        // sum on count here as this operation drops all other fields from the point
        .as('totalCount')

var sharedLBTrafficStream = trafficByIPAndStoreIDStream
    |where(lambda: "source" == 'sharedlb')
    |sum('count')
        // sum on count here as this operation drops all other fields from the point
        .as('totalCount')

var joinedTrafficStream = storeLBTrafficStream
    |join(sharedLBTrafficStream)
        // join the sum streams together with their tags, reminder that fields other than totalCount are dropped due to the upstream
        .as('storelb', 'sharedlb')
        .tolerance(1s)

joinedTrafficStream
    |alert()
        .warn(lambda: "storelb.totalCount" > 50000 OR "sharedlb.totalCount" > 50000 OR "storelb.totalCount" + "sharedlb.totalCount" > 50000)
        .crit(lambda: "storelb.totalCount" > 80000 OR "sharedlb.totalCount" > 80000 OR "storelb.totalCount" + "sharedlb.totalCount" > 80000)
        .message('Production IP {{ index .Tags "ip" }} has exceeded request thresholds for the Store ID {{ index .Tags "id" }} via the Shared LB: {{ index .Fields "sharedlb.totalCount" }} and the Store LB: {{ index .Fields "storelb.totalCount" }} within the last 2 minutes.')
        .stateChangesOnly(5m)
        .slack()
        .channel('#ops-noise')
        .exec('/usr/bin/kapacitor_pubsub_stdin_invoker.sh')

with a dot graph of:

digraph ips_and_stores {
graph [throughput="16786.69 points/s"];

stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ];
stream0 -> from1 [processed="128532482"];

from1 [avg_exec_time_ns="26.78µs" errors="0" working_cardinality="0" ];
from1 -> barrier2 [processed="128532482"];

barrier2 [avg_exec_time_ns="67.601µs" errors="0" working_cardinality="238931" ];
barrier2 -> window3 [processed="128532108"];

window3 [avg_exec_time_ns="93.931µs" errors="0" working_cardinality="238931" ];
window3 -> where6 [processed="33450748"];
window3 -> where4 [processed="33450748"];

where6 [avg_exec_time_ns="38.962µs" errors="0" working_cardinality="60604" ];
where6 -> sum7 [processed="33450748"];

sum7 [avg_exec_time_ns="46.81µs" errors="0" working_cardinality="60606" ];
sum7 -> join9 [processed="33450748"];

where4 [avg_exec_time_ns="43.106µs" errors="0" working_cardinality="60604" ];
where4 -> sum5 [processed="33450748"];

sum5 [avg_exec_time_ns="46.444µs" errors="0" working_cardinality="60604" ];
sum5 -> join9 [processed="33450748"];

join9 [avg_exec_time_ns="117.824µs" errors="0" working_cardinality="6803685" ];
join9 -> alert10 [processed="33450745"];

alert10 [alerts_inhibited="0" alerts_triggered="0" avg_exec_time_ns="51.397µs" crits_triggered="0" errors="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" working_cardinality="6803683" ];
}

...showing high cardinality from the join, join9, forward. With the throughput rates shown in the dot graph our kapacitor instances need restarting about every 1 hour and 50 minutes with a 45GB memory limit set on the process defined by MemoryMax=45G on the systemd unit file.

conet commented 1 year ago

I'm also seeing a similar behavior using the latest release 1.7.1 here's a top of the heap hprof:

Showing nodes accounting for 9203.41MB, 97.93% of 9398.23MB total
Dropped 243 nodes (cum <= 46.99MB)
Showing top 10 nodes out of 40
      flat  flat%   sum%        cum   cum%
 3537.67MB 37.64% 37.64%  3537.67MB 37.64%  github.com/influxdata/influxdb/models.Tags.Map (inline)
 3410.08MB 36.28% 73.93%  3410.58MB 36.29%  github.com/influxdata/influxdb/models.(*point).unmarshalBinary
 1166.16MB 12.41% 86.33%  1166.16MB 12.41%  github.com/influxdata/kapacitor/edge.(*pointMessage).ShallowCopy
  448.08MB  4.77% 91.10%   448.08MB  4.77%  github.com/influxdata/kapacitor/alert.newHandler
  246.01MB  2.62% 93.72%   246.01MB  2.62%  strings.(*Builder).grow (inline)
  200.91MB  2.14% 95.86%   200.91MB  2.14%  github.com/influxdata/kapacitor.(*windowTimeBuffer).insert
  194.50MB  2.07% 97.93%  7157.76MB 76.16%  github.com/influxdata/kapacitor.(*TaskMaster).WritePoints
         0     0% 97.93%  3410.58MB 36.29%  github.com/influxdata/influxdb/models.(*point).Fields
         0     0% 97.93%   456.60MB  4.86%  github.com/influxdata/kapacitor.(*AlertNode).runAlert
         0     0% 97.93%  1415.18MB 15.06%  github.com/influxdata/kapacitor.(*FromNode).Point

I'm using a lot of stream queries based on this example

conet commented 1 year ago

WritePoints seems to be causing the problem, where is it used?

File: kapacitord
Build ID: 7cdd357954d1cbca82a4f08b6fcbce65e372a1fe
Type: inuse_space
Time: Nov 30, 2023 at 7:06am (EST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 17.23GB, 98.79% of 17.44GB total
Dropped 232 nodes (cum <= 0.09GB)
Showing top 10 nodes out of 41
      flat  flat%   sum%        cum   cum%
    6.84GB 39.20% 39.20%     6.84GB 39.20%  github.com/influxdata/influxdb/models.Tags.Map (inline)
    6.48GB 37.18% 76.38%     6.49GB 37.18%  github.com/influxdata/influxdb/models.(*point).unmarshalBinary
    2.24GB 12.84% 89.21%     2.24GB 12.84%  github.com/influxdata/kapacitor/edge.(*pointMessage).ShallowCopy
    0.50GB  2.84% 92.05%     0.50GB  2.84%  strings.(*Builder).grow
    0.44GB  2.54% 94.59%     0.44GB  2.54%  github.com/influxdata/kapacitor/alert.newHandler
    0.37GB  2.10% 96.69%    13.74GB 78.80%  github.com/influxdata/kapacitor.(*TaskMaster).WritePoints
    0.24GB  1.40% 98.09%     0.24GB  1.40%  github.com/influxdata/kapacitor.(*windowTimeBuffer).insert
    0.12GB   0.7% 98.79%     0.12GB   0.7%  io.ReadAll
         0     0% 98.79%     6.49GB 37.18%  github.com/influxdata/influxdb/models.(*point).Fields
         0     0% 98.79%     0.45GB  2.57%  github.com/influxdata/kapacitor.(*AlertNode).runAlert

influxdata / kapacitor