rkosegi / netflow-collector

Simple Netflow V5 exporter for prometheus
Apache License 2.0
12 stars 4 forks source link

Memory leaking issue #125

Closed vaygr closed 1 week ago

vaygr commented 1 month ago

Hello, recently I realized that netflow-collector container has been eating memory uncontrollably.

shot-2024-10-01_200453

I'm running the latest 1.0.2 tag with this config:

---
netflow_endpoint: 0.0.0.0:2055
telemetry_endpoint: 0.0.0.0:30001
flush_interval: 120
pipeline:
  filter:
    - local-to-local: true
    - match: source_ip
      is: 0.0.0.0
    - match: source_ip
      is: 255.255.255.255
    - match: destination_ip
      is: 0.0.0.0
    - match: destination_ip
      is: 255.255.255.255
  enrich:
    - protocol_name
  metrics:
    prefix: netflow
    items:
      - name: traffic_detail
        description: Traffic detail
        labels:
          - name: sampler
            value: sampler
            converter: ipv4
          - name: source_ip
            value: source_ip
            converter: ipv4
          - name: destination_ip
            value: destination_ip
            converter: ipv4
          - name: protocol
            value: proto_name
            converter: str

One thing I noticed this could be due to flow_traffic_bytes, flow_traffic_packets, flow_traffic_summary_size_bytes, flow_traffic_summary_size_bytes_sum, flow_traffic_summary_size_bytes_count metrics coming from the goflow package that grow for every port, but are practically useless. They result in over 20MB of payload for each scrape. I'm curious if it would be possible to turn them off or what other issue could be. Maximum on the picture above is roughly 2GB of memory.

rkosegi commented 1 month ago

Hi @vaygr, since you seems to have some sort of monitoring in place, can you share how graphs looks like for go_memstats_heap_alloc_bytes and go_memstats_heap_sys_bytes? I'm running collector for long time and never seems to hit any sort of memory leak (I don't have huge traffic either, so I never push it too hard).

I noticed that memory usage stabilize at some value that depends on 2 factors, number of active flows and value of flush_interval setting, here is example of mine: image Higher the value value of flush_interval, more active flows will be kept in memory.

Can you share graph of count(netflow_flow_traffic_detail{}) metric? It should have cycles with length depending on flush_interval, example on mine: image

In my case, I have flush_interval=36000, which is pretty high, but ok for low-volume traffic.

vaygr commented 1 month ago

Here you go:

shot-2024-10-02_064013

The second graph's range is 1 week:

shot-2024-10-02_064237

vaygr commented 1 week ago

@rkosegi any idea what could be causing this? steps to debug further?

rkosegi commented 1 week ago

@rkosegi any idea what could be causing this? steps to debug further?

I'm not able to replicate your problem using provided config or some variation of it. New version of collector (v1.0.3) has just been released, so please give it a try first.

vaygr commented 1 week ago

It's been running for a few hours, and I honestly don't see any change in behavior. Maybe you could try the docker image I'm using and see if you can reproduce this with it? vaygr/netflow-collector:1.0.3

Other than that here's the graph with similar metrics of what you posted earlier, and I see memory almost never gets released:

shot-2024-10-26_112058

It's 1 week of running the collector and it got to over 1GB.

vaygr commented 1 week ago

So here's the last 48 hours of running 1.0.3.

shot-2024-10-28_101029

I also attached anonymized metrics sample: metrics-anon.txt

rkosegi commented 1 week ago

after looking at attached metrics, it's clear that collector itself doesn't pollute heap so much, when compared to goflow's built-in metrics (flow_traffic_XXX). Looks like https://github.com/cloudflare/goflow/issues/94. Unfortunately, I don't see how this can be disabled programatically

vaygr commented 1 week ago

@rkosegi it's unclear to me why you can't reproduce it though and memory doesn't get reclaimed in time. Is it because of the different sender (in my case OPNsense's flowd)?

rkosegi commented 1 week ago

yeah, if you check linked issue above, there is similar behavior explained using pmacctd - which doesn't happen in my case

vaygr commented 1 week ago

I see. I guess one option could be switching over to https://github.com/netsampler/goflow2 given goflow project stagnation.

For me the workaround could be either applying https://github.com/cloudflare/goflow/pull/95 during the build or limiting RAM resources at the container level.

rkosegi commented 1 week ago

Moving to goflow2 seems like logical choice, but I can't promise any ETA. Pull requests are always welcome btw

vaygr commented 1 week ago

I tested both container limits and scheduled restarts -- both work well. Thanks a lot for your help in troubleshooting this.

At this point I'll leave the decision / ETA for migration to goflow2 up to you, but yes, as you noted, it could be a better supported backend, so might make sense to migrate as soon as there are resources.