netsampler / goflow2

High performance sFlow/IPFIX/NetFlow Collector
BSD 3-Clause "New" or "Revised" License
424 stars 99 forks source link

Resource usage Goflow v2 #217

Closed bpereto closed 9 months ago

bpereto commented 9 months ago

Describe the bug I'm testing the new v2 version of goflow and encounter a steady increase in memory consumption when running it with ca. 20k f/s until OOM-Killer invokes.

I encountered it with the normal file transport as well with a custom pulsar transport and protobuf format (not the binary one).

Included in this report is a heap dump indicating a memory issue in github.com/netsampler/goflow2/v2/utils.glob..func1 using the console output with the default json format.

UDP Buffers are stable and not overflowing, therefore I assume it is a problem with the sync.Pool handling.

To Reproduce Steps to reproduce the behavior:

  1. Run GoFlow2:
    
    ./goflow2.v2 -listen=netflow://:3001 -profile

(note -profile is pprofile added)

2. look for memory consumption and OOM.

**Expected behavior**
Stable memory usage.

**Captures**

**Heap**

File: goflow2.v2 Type: inuse_space Time: Sep 27, 2023 at 1:58pm (CEST) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top10 Showing nodes accounting for 5.06GB, 99.54% of 5.08GB total Dropped 36 nodes (cum <= 0.03GB) flat flat% sum% cum cum% 5.03GB 98.96% 98.96% 5.03GB 98.96% github.com/netsampler/goflow2/v2/utils.glob..func1 0.03GB 0.59% 99.54% 0.04GB 0.77% net.(UDPConn).ReadFromUDP (inline) 0 0% 99.54% 5.07GB 99.73% github.com/netsampler/goflow2/v2/utils.(UDPReceiver).receive 0 0% 99.54% 5.07GB 99.73% github.com/netsampler/goflow2/v2/utils.(UDPReceiver).receivers.func1 0 0% 99.54% 5.03GB 98.96% sync.(Pool).Get


![Screenshot 2023-09-27 at 14 29 43](https://github.com/netsampler/goflow2/assets/1131885/b732eac1-70e5-4ae0-93b9-83e06260a6dc)

![profile002](https://github.com/netsampler/goflow2/assets/1131885/93b87594-cedb-4bf4-ad7e-c14266f29c22)

**GoFlow2:**
 - Version: 34a0c1618e6fcf0d509c4553b3236b4aa85c71ca (master)
 - Environment: Linux
 - OS: Rocky Linux 9.1

**Additional context**
Add any other context about the problem here.
lspgn commented 9 months ago

Hi @bpereto, Thank you for the report. My hunch is you might be receiving more flows that you can process.

Some questions:

  1. have you tried with commit https://github.com/netsampler/goflow2/commit/9f2b0806f0bdaf409fd682f4194df202922104ac

  2. so far it seems that the affected function is:

https://github.com/netsampler/goflow2/blob/34a0c1618e6fcf0d509c4553b3236b4aa85c71ca/utils/udp.go#L32-L36

which is expected to be the one allocating the memory but message should be put in the pool back at:

https://github.com/netsampler/goflow2/blob/34a0c1618e6fcf0d509c4553b3236b4aa85c71ca/utils/udp.go#L219

are you running it with custom garbage collection settings?

  1. are you specifying a queue size? if not, could you run with -listen netflow://:3001/?queue_size=100000: I'm expecting it to cap at 900MB RAM and start dropping UDP packets (or stop processing)

  2. could you provide me with the list of goroutines? There could be something blocked at decoding and it keeps allocating into the dispatch channel (default setting is 9GB -> 1000000 messages of 9000 bytes).

  3. lastly: what are the machine specs? 20k flows per second is quite high for a single machine. It could be that the packets aren't processed fast enough (htop should indicate if all the processors are used). If not, try also -listen netflow://:3001/?queue_size=100000&count=8 (or however many CPUs) and make sure GOMAXPROCS environment variable is not set to a smaller amount of CPUs.

Per your screenshot, it took 345 sec to use 11GB of RAM => 20.18MB/s of allocations => which likely shows a channel growth of 2243 messages/sec. This would mean with a rate of 20k flows per second, there are only enough resources to process 17.7k flows per second. At this rate, a single CPU need to process a message in less than 50 microseconds (you can find this information with flow_decoding_time_seconds Prometheus metric). Hopefully more workers and sockets should help.

Hopefully, this should allow me to isolate the issue.

bpereto commented 9 months ago

Hi,

Thanks for your suggestions.

  1. have you tried with commit https://github.com/netsampler/goflow2/commit/9f2b0806f0bdaf409fd682f4194df202922104ac

I'm using https://github.com/netsampler/goflow2/commit/34a0c1618e6fcf0d509c4553b3236b4aa85c71ca (main) and should be included. Also the fix is in producer_sf.go - i'm using netflow/ipfix one.

  1. are you running it with custom garbage collection settings?

No, no custom settings.

  1. Would you provide me with the list of goroutines? There could be something blocked at decoding and it keeps allocating into the dispatch channel (default setting is 9GB -> 1000000 messages of 9000 bytes).

How can I do that?

  1. are you specifying a queue size? if not, could you run with -listen netflow://:3001/?queue_size=100000: I'm expecting it to cap at 900MB RAM and start dropping UDP packets (or stop processing)
  2. what are the machine specs? 20k flows per second is quite high for a single machine.

With the v1 and same single instance (and multiple processes) I'm able to process 10x the amount. Indeed it's probably the missing CPU resources, as the following testing shows, and the used resources were quite at the limit during the previous testing.

Results

v1

v2 (https://github.com/netsampler/goflow2/commit/34a0c1618e6fcf0d509c4553b3236b4aa85c71ca)

the decoding time, despite different metric names and formats are similar or even faster with v2. I'm a bit surprised, that with v2, the CPU Load more than doubled - and therefore I should resize my setup.

Not to myself: do isolated testing with enough ressources. case closed then? Or if you have a hint how I can trace which routine causes the CPU increase, we may can improve the resources needed for processing.

bpereto commented 9 months ago

ok, further testing with different outputs yields other results.

On a 16 vCPU Machine for goflow v2: -listen netflow://:3001/?count=4 with currently 40k f/s input:

it's reasonable that the formatting of JSON https://github.com/netsampler/goflow2/blob/34a0c1618e6fcf0d509c4553b3236b4aa85c71ca/producer/proto/messages.go#L88 needs more compute ressources.

lspgn commented 9 months ago

Hello, Thank you for this info. Could you try with count=16&queue_size=100000? And with just queue size.

Some of the changes with V2 were done on the conversion to protobuf (producer module) using reflection but it may be more impactful than I initially thought. The tradeoff was allowing more custom fields to be mapped. A custom producer should solve it to do the mapping (in addition to the formatter) and compiling it.

In regards to listing the goroutines, you need to enable pprof (import _ "net/http/pprof") and fetch the debug page on /debug/pprof/goroutines (should be on the same endpoint as metrics.

bpereto commented 9 months ago

I would say its similar

(16 vCPU, 45k f/s)

lspgn commented 9 months ago

@bpereto thanks how is the RAM profile? does it still OoM after 10min?

do the logs confirm count and queue_size with the entered values?

bpereto commented 9 months ago

Hi,

I didn't record all the RAM profile during the measurements but it stays on average the same. With increased number of sockets, the UDP Memory is on average a little higher.

So no OOM after 10min - it is stable even after 1min.

Some of the changes with V2 were done on the conversion to protobuf (producer module) using reflection but it may be more impactful than I initially thought. The tradeoff was allowing more custom fields to be mapped.

the mapping is really useful - but works only in the json and text output (FormatMessageReflectCustom) with said reflection. I currently extended the protobuf schema to also include the mapped fields in binary, or in my case protobuf format.

for me the issue is resolved as the resource consumption is on average with the binary/protobuf output format the same with v1 and v2.

lspgn commented 9 months ago

Hi @bpereto, Is this resolved?

The mapping should work with every output. Example in the following mapping.yaml.

formatter:
  protobuf: # manual protobuf fields addition
    - name: flow_direction
      index: 42
      type: varint
ipfix:
  mapping:
    - field: 61
      destination: flow_direction

IPFIX fields identified by 61 will be serialized into a protobuf as a varint at index 42. Your tools (not GoFlow2) decoding the binary format must use a custom flow.proto with:

message FlowMessage {
  uint64 flow_direction = 42;

Otherwise, those fields, without a type, will only be available with protoreflect as "unknown data". As deserializers need a type.

bpereto commented 9 months ago

Yes, this is resolved.

thank you for the explanation of the mapping.