netsampler / goflow2

High performance sFlow/IPFIX/NetFlow Collector
BSD 3-Clause "New" or "Revised" License
487 stars 112 forks source link

Add workers, isBlocking and queue size to listen config #208

Closed mieczkowski closed 1 year ago

mieczkowski commented 1 year ago

I was testing flows collecting from pmacct and goflow2 and realized that goflow2 is loosing packets (goflow2 should have MORE data, because pmacct is aggregating in memory and flush aggregated)

SELECT *
FROM collector_compare

Query id: 4d8b440f-06ff-4ff8-b732-8292b5b16b44

┌─type───────┬─pmacct_count────┬─goflow_count────┬─diff───────────┬─percent─┐
│ IPFIX      │ 199.16 thousand │ 190.66 thousand │ -8.50 thousand │ -4.3%   │
│ SFLOW_5    │ 33.23 million   │ 24.47 million   │ -8.75 million  │ -26.3%  │
│ NETFLOW_V5 │ 202.00          │ 203.00          │ 1.00           │ 0.5%    │
└────────────┴─────────────────┴─────────────────┴────────────────┴─────────┘

After debugging I realized that when we have 1 socket = 1 worker -> with high volume of flows per second reading packet will be faster than processing (and dispatch channel will be growing), so I propose:

  1. Add config options to listening url with handling defaults (if blocking=0 queue=1000000 else queue=0 , and numWorkers = sockets*2)
  2. In udp.go increase default workers count to cfg.Sockets * 2

After changes:

┌─type───────┬─pmacct_count───┬─goflow_count───┬─diff───────────┬─percent─┐
│ IPFIX      │ 10.16 million  │ 10.16 million  │ 274.00         │ 0%      │
│ SFLOW_5    │ 345.81 million │ 484.31 million │ 138.50 million │ 28.6%   │
│ NETFLOW_V5 │ 9.21 thousand  │ 9.20 thousand  │ -7.00          │ -0.1%   │
└────────────┴────────────────┴────────────────┴────────────────┴─────────┘
lspgn commented 1 year ago

Hi @mieczkowski , Thank you for the performance comparison. Could you also provide information about the setup (CPU/cores/arch, live traffic/simulator, pps)? After the changes, is there an explanation for having SFLOW_5 be +28%?

lspgn commented 1 year ago

I may refactor a bit this PR, I think adding configuration options like workers is good but I don't want the program to take too many assumptions

mieczkowski commented 1 year ago

Could you also provide information about the setup (CPU/cores/arch, live traffic/simulator, pps)?

Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz, linux debian, 8 cores. Real flows from some production environments. 20-25k flows per second.

┌────────────datetime─┬─count()─┐
│ 2023-08-26 17:00:09 │   21774 │
│ 2023-08-26 17:00:08 │   24071 │
│ 2023-08-26 17:00:10 │   20860 │
│ 2023-08-26 17:00:04 │   21283 │
│ 2023-08-26 17:00:03 │   19179 │
│ 2023-08-26 17:00:02 │   20249 │
│ 2023-08-26 17:00:05 │   20164 │
│ 2023-08-26 17:00:01 │   20832 │
│ 2023-08-26 17:00:06 │   20018 │
│ 2023-08-26 17:00:07 │   19880 │
│ 2023-08-26 17:00:00 │   22058 │
└─────────────────────┴─────────┘

After the changes, is there an explanation for having SFLOW_5 be +28%?

pmacct is aggregating flows, so for example if in one second there will be 5 flows with identical params, pmacct will group them and put 5 in flows key, and sum bytes+packets. Goflow2 sends raw flows, so have more records in DB.

I may refactor a bit this PR

I have nothing against :) Maybe good idea would be some "dynamic workers" mechanism. If dispatch channel is growing - run more workers, and clean them if traffic is falling.

lspgn commented 1 year ago

Maybe good idea would be some "dynamic workers" mechanism. If dispatch channel is growing - run more workers, and clean them if traffic is falling.

It's a good idea but I'm not sure if the added complexity is worth it: flows will have spikes periodically (unless ddos) and I don't think idle workers consume that much? since this is only a single "task" (decoding flows) and it will have to compete against other applications without being able to kick them out. I would rather suggest fixed capacity allocation. Overall I think it's more for platforms like Kubernetes to guarantee a minimum resource, autoscale or reschedule elsewhere.

mieczkowski commented 1 year ago

I would also suggest moving initialization of metrics to separate package. When I wanted to add metric for dropping packets (in udp.go) there was a import cycle (metrics import utils).

lspgn commented 1 year ago

moving initialization of metrics to separate package

there is https://github.com/netsampler/goflow2/blob/main/metrics/metrics.go but I designed in such a way it should be callbacks/wrappers called from the main app

mieczkowski commented 1 year ago

But there is an import of utils package: https://github.com/netsampler/goflow2/blob/main/metrics/decoder.go#L9 So you can't use metrics in utils pkg (where should be counter for dropped packets)

lspgn commented 1 year ago

So you can't use metrics in utils pkg (where should be counter for dropped packets)

Correct, I wanted to separate metrics from processing/decoding: for instance, if someone imports the package, it won't impose Prometheus metrics. The GoFlow2 app has decodeFunc callback which is wrapped by metrics. But I agree there is a need to count the number of dropped packets.