Aggregated flows consume more space than collected and other strange things with 1.7

aldem commented 1 year ago

Hi,

Recently I decided to try out 1.7.1 but it feels a bit... different.

In particular:

collected files are larger (15-25%) than they were in 1.6 (same compression type)
nfcapd consumes 2x more CPU than 1.6
- while absolute value is relatively low (~1% in 1.6 and 2% in 1.7 with usual load) there are cases when this matters, for instance during (D)DoS CPU usage in 1.6 was ~50-70% but with 1.7 it is close to 100%.
aggregated flows (processing via -a) increased file size by 10-15% while it used to be opposite with 1.6 (usually I have ~150-200K flows per file)
- I would expect that in worst case (when all flows are unique) size would not be changed at all, but increase?

I didn't change anything in options that were used in 1.6 except for removing -T and -l from nfcapd. -T in theory could have some effect on collected file sizes as I don't need all the data which is sent but now there is no option to get rid of it.

I am a bit confused and would like to know - did someone else also observed similar effects?

phaag commented 1 year ago

There are a few things to explain, which may help you to identify some potential bottlenecks and understand why more CPU or bigger files result. Well everything gets bigger these days - right :)

The file size: nfdump-1.7.x stores the flow records in a slightly different binary format. This change of the format was necessary, due to the increased demand on new flow elements, flexibility for variable length records and other record types. The goal still was to optimise speed, overhead and disk space. This comes at the price of a bit larger file size for the raw collected flow files. The amount differs on the type and number of record elements in the netflow data. As an example I have a flow file with ~28Mio records and this compares to:

%ls -al
-rw-r--r--   1 peter  staff  893548622 Mar  4 13:03 nfcapd.1.7.nf
-rw-r--r--   1 peter  staff  845419814 Mar  4 12:53 nfcapd.1.6.x.nf

Both files are lz4 compressed. In this case, the overhead is about 5%. It may of course differ for other compressions and other type of flow records. Up to 10% seems reasonable, however 25% is too much. You may verify file with -v

% nfdump -v nfcapd.1.7.nf
File       : nfcapd.1.7.nf
Version    : 2 - lz4 compressed
Created    : 2023-03-04 13:02:30
Created by : nfcapd
nfdump     : f1070100
encryption : no
Appdx blks : 1
Data blks  : 3394
Checking data blocks
Checking appendix blocks

Total
Type 3 blocks : 3395
Records       : 28680226

% nfdump -v nfcapd.1.6.x.nf
File       : nfcapd.1.6.x.nf
Version    : 1 - lz4 compressed
Blocks     : 2533
Checking data blocks

Total
Type 2 blocks : 2533
Records       : 28680232

The number of records differs because of 1.6.x needs some internal extension records.

Size of aggregated flows: The size of aggregated flow is usually smaller, depending on how many flows could be aggregated. Again using the 28Mio flow file:

-rw-r--r--   1 peter  staff  893548588 Mar  4 13:27 aggregated
-rw-r--r--   1 peter  staff  893548622 Mar  4 13:03 nfcapd.1.7.nf

If you aggregate flows, new fields need to be added representing the number of flows. This adds an overhead, the aggregation itself reduces the size. Please make sure the resulting file is compressed the same way as the original file.

If you have flows, which can not be aggregated, in theory the resulting file could be slightly bigger, which I never came across so far.

CPU This is a bit more complex as this depends on various variables:

nfdump-1.6.x is not threaded, where as nfdump-1.7.x is. The parallel tasks are split between collecting flows and writing them into a memory buffer and compressing and writing the those buffers to disk. Furthermore nfdump 1.7.x got a more advanced decoder for ipfix and netflow v9 in order to cope with the increased demand of formats and data. However for v5 and v1 it is still the same code. As we have seen above, 1.7 increases the file size by let's say 5-10% which is more work to compress and write on the other side the decoder is very fast and allows high packet rates.

If we speak about CPU its about the sum of %user+%system+%iowait. Therefore you have to carefully check for a potential bottleneck. If this is 1% or 2% I think this is not really the point. If you have a loaded system, as you describe while D(DoS), things matter. nfcapd-1.6 has only 1 buffer - the system packet buffer (arg -B). Packets are processed and compressed and stored in a sequential process. This means it needs to wait until data is compressed and written. Depending on your systems I/O throughput this also affects the performance. CPU and I/O system are important in this chain. If you have a very fast I/O system, then not compressing at all would be the fastest way to store the data for the price of disc space. In general lz4 ( -y for nfcapd) is the best trade-off for fast compression and write. In general bzip2 should be avoided for the collector, as it takes way too much CPU to complete. Use bzip2 for archiving but never for collecting flows. To see, where you spend your CPU (%iowait or %user or idle), check out for one of the several tools available.

In nfcapd-1.7.x there are more buffers and the multithreading allows to drain more CPU from the system if required. Flows are collected while data buffers are compressed and written. Therefore under a heady loaded packet stream nfcapd-1.7 performs much better the 1.6, but needs the CPU to complete the tasks. This means nfcapd-1.6 is limited at a certain point and starts dropping packets, but uses less CPU, where as 1.7 can process more packets at the cost of using more CPU.

Compare the following numbers:

I converted the 28Mio flows file back into a pcap data stream file which results in:

-rw-r--r--   1 peter  staff  5597879468 Mar  4 12:24 flowbig.pcap

so, it's about a 5.2GB pcap file. In order to see how efficient the collector works, you can feed the pcap instead of reading from the network. This eliminates the system buffer and the collector processes the packets in the pcap file as fast as possible. To process the pcap, I use these commands:

% time nfcapd-1.6 -Tall -f flowbig.pcap -l tmp -y -t 3600
...
Ident: 'none' Flows: 28680221, Packets: 570974278, Bytes: 427025989215, Sequence Errors: 5, Bad Packets: 0
Terminating nfcapd.
351.127u 11.109s 6:28.83 93.1%  0+0k 0+0io 1pf+0w

avg CPU was 93%

and

% time nfcapd-1.7 -f flowbig.pcap -w tmp -y -t 3600
...
Ident: 'none' Flows: 28680221, Packets: 570974278, Bytes: 427025989215, Sequence Errors: 5, Bad Packets: 0
Terminating nfcapd.
39.945u 3.228s 0:38.48 112.1%   0+0k 0+0io 27pf+0w

avg CPU was 112%

The tests ran on an older MacBook with an i7 and SSD HD.

If you compare the time needed to process the pcap as fast as possible, 1.7 is ways faster than 1.6.x but needs more CPU. This improvement results from the multithreading and the more efficient decoder. 1.6 processes a network stream of ~127Mbit/s where as 1.7 copes with ~1.1GBit/s.

As a result nfcapd-1.6 will drop earlier packets, if it can not process those, due to limited resources - a single thread.

Last but not least: The missing -T option is the result of so many user request: "Why do I see only this amount of elements in a flow record? My router sends much more" . The collector now takes everything it understands. Disk space is no longer that of an issue, as it was, when nfdump was born.

I hope to shed some light on a bit more complex issue. Sorry for this long answer.

aldem commented 1 year ago

Thank you Peter, your answer definitely made things clear - and in fact I really like long answers with details :)

However, my concerns about disk space are bit different - I do collection into RAM (tmpfs) to avoid thrashing disks as no disks (within my budget at least) are able to handle peaks, while SSDs are dying quite fast if used for this purpose - this probably could be mitigated by highest-end enterprise models but those are far beyond my budget, but two Samsung's Datacenter PM893 did not survive even one year (what is strange though - not because of TBW limit).

Due to all this, my current processing flow is as follows:

collect into RAM in small files (10s interval)
- collection and some processing is done on the router itself (I use ipt_NETFLOW as exporter) to avoid extra network traffic and potential packet loss
get quick stats on every completed file to identify attacks, scans etc.
- active scans could easily generate tens of millions of flows within those 10s, and they happen quite regularly
- this in turn requires enormous amount of RAM to process with nfdump, if I would use 1m interval I could not process it at all
post-process (remove noise and irrelevant flows - up to ~ 50% reduction in size)
- unfortunately I could not do this on exporter side
aggregate for subsequent accounting and statistics (this reduces logs to ~5% of original size)
- store the result in safe place (NAS etc) for long-term storage

Thus I am trying to reduce space consumed as much as possible - RAM is very valuable and limited, unlike disks, and even 10% could matter a lot. Using bz2 is not an option as compression of one file takes longer that its production even in threaded mode.

The ability to filter out unnecessary and not used/collected fields would be really nice - while it looks negligible it actually translates to significant reduction in size even after compression.

As to aggregation... I did some experiments, and results are interesting - it looks like the size increases only for aggregated and compressed data. I have a small log with only 157K records, results are as follows:

## This one is without any compression
-rw-r--r-- 1 root root 25057757 Mar  4 18:23 nfcapd.20230304181210.raw
# nfdump -v nfcapd.20230304181210.raw
File       : nfcapd.20230304181210.raw
Version    : 2 - not compressed
Created    : 2023-03-04 18:23:15
Created by : nfdump
nfdump     : f1070100
encryption : no
Appdx blks : 1
Data blks  : 24
Checking data blocks
Checking appendix blocks
 -
Total
Type 3 blocks : 25
Records       : 157573

## This one is slightly less after aggregation (-a and no compression)
-rw-r--r-- 1 root root 25008909 Mar  4 18:24 nfcapd.20230304181210.raw.agg
# nfdump -v nfcapd.20230304181210.raw.agg
File       : nfcapd.20230304181210.raw.agg
Version    : 2 - not compressed
Created    : 2023-03-04 18:24:24
Created by : nfdump
nfdump     : f1070100
encryption : no
Appdx blks : 1
Data blks  : 24
Checking data blocks
Checking appendix blocks
 -
Total
Type 3 blocks : 25
Records       : 157201

# This one is compressed version of .raw (-y) while nfdump -v is equivalent to .raw
-rw-r--r-- 1 root root  3741267 Mar  4 18:25 nfcapd.20230304181210.lz4
#  ~/bin/nfdump -v nfcapd.20230304181210.lz4
File       : nfcapd.20230304181210.lz4
Version    : 2 - lz4 compressed
Created    : 2023-03-04 18:25:04
Created by : nfdump
nfdump     : f1070100
encryption : no
Appdx blks : 1
Data blks  : 24
Checking data blocks
Checking appendix blocks
 -
Total
Type 3 blocks : 25
Records       : 157573

# And this one is compressed & aggregated version or .raw (-y -v), and nfdump -v is equivalent to .agg
-rw-r--r-- 1 root root  4802716 Mar  4 18:25 nfcapd.20230304181210.agg.lz4
# nfdump -v nfcapd.20230304181210.agg.lz4
File       : nfcapd.20230304181210.agg.lz4
Version    : 2 - lz4 compressed
Created    : 2023-03-04 18:25:13
Created by : nfdump
nfdump     : f1070100
encryption : no
Appdx blks : 1
Data blks  : 24
Checking data blocks
Checking appendix blocks
 -
Total
Type 3 blocks : 25
Records       : 157201

As you could see, this is quite significant difference - 28%, and my only clue is that nfdump adds all available fields to the destination file even if they were not present in the original file.

I did similar test for bigger set of data (~ 4.8M flows) and the difference was similar.

May be my use case is a bit off (RAM instead of disks), but nevertheless there are systems (maybe even embedded) where disk space is scarce, this probably it still makes sense to have an option to filter out fields which are unused in some scenarios (mac addresses, mpls labels etc).

phaag commented 1 year ago

Thanks for your answer.

The aggregation does not add any unused fields. I only add one single extension and only if the number of aggregated flows for that record is > 1. Therefore I do not really understand, why that overhead is that big. Anyway, you can see which extension elements are used by each flow records by using the -o raw format.

% nfdump -r flowbig -o raw -c 1

Flow Record:
  RecordCount  =                 1
  Flags        =              0x00 FLOW, Unsampled
  Elements     =                 6: 1 2 4 7 10 12
  size         =               124
  engine type  =                 2
  engine ID    =                 3
  export sysid =                 1
...
more fields

The elements line show which extensions used to compile this record. Extension 1 is always required, all others are added from the template, which is sent by the exporter. If flows are aggregated, the extension 5 is added, if the number of aggregations is > 1, otherwise no changes are done. Extension 2 or 3 is needed for IP addresses. All extensions can be reviewed in nfxV3.h.

I would be interested to analyse such a file. If it would be possible for you to share such a nfcapd file not aggregated, which shows this behaviour, you could send it to me be email bzip2 compressed :) to my email in the AUTHORS file.

I do not want to change the default behaviour of 1.7 for the collection of elements and back porting the old -T option will inevitably lead to unexpected behaviour as 1.6 has different type of element records. That's why it has been removed altogether. However, I will check, if a new option could do the trick - e.b. -e <extension-list>

aldem commented 1 year ago

The file should have arrived in your mailbox by now.

Regarding the filtering of templates on output, I agree that you do not need to restore the previous behavior. A dedicated option would be much better.

However, filtering could be done implicitly when aggregation is requested. In this case, we would know exactly which fields are meaningful, and everything else would not be needed by definition (except for obvious fields like start/stop time, counters, and maybe some other similar fields).

phaag commented 1 year ago

Let's do in in several steps - first I will implement option -X on the collector. Data which is not needed need not to be stored. The implicit drop of extensions is a bit more complicated to implement.

phaag commented 1 year ago

Could you try the latest master, which implements option -X.

Only the matching extensions are stored by the collector. The extension IDs correspond to the definitions in nfxV3.h. For the minimal required record, use -X 1,2,3.

Does this make a difference for you? At least the collected flow files are now smaller.

aldem commented 1 year ago

Yes, thank you - it does make a difference when there are millions of flows.

The mystery with increased size after aggregation remains unsolved though. Did you have a chance to look at my file?

I see that number of extensions is reduced after aggregation (exactly like in source) but the resulting size is still significantly larger, there must be something else which I am unable to figure out yet.

phaag commented 1 year ago

Please update to the master branch! There was a bug in the init functions!

phaag commented 1 year ago

I received the file, but could net yet figure out the mystery. All compression algorithms produce a larger file.

aldem commented 1 year ago

Well, maybe this output from valgrind could hint you on something (I am not yet familiar with the code):

==355767== Use of uninitialised value of size 8
==355767==    at 0x4854F7B: LZ4_putPositionOnHash (lz4.c:505)
==355767==    by 0x4854F7B: LZ4_putPosition (lz4.c:513)
==355767==    by 0x4854F7B: LZ4_compress_generic (lz4.c:691)
==355767==    by 0x4854F7B: LZ4_compress_fast_extState (lz4.c:746)
==355767==    by 0x48562E1: LZ4_compress_fast (lz4.c:765)
==355767==    by 0x48562E1: LZ4_compress_default (lz4.c:776)
==355767==    by 0x4865D16: Compress_Block_LZ4 (nffile.c:251)
==355767==    by 0x4865D16: nfwrite (nffile.c:1243)
==355767==    by 0x4864EDE: nfwriter (nffile.c:1291)
==355767==    by 0x48E0EA6: start_thread (pthread_create.c:477)
==355767==    by 0x4AF1A2E: clone (clone.S:95)

There are many similar issues - and this only happens when I provide -a flag.

Line numbers may be a bit off though as I am experimenting with my zstd support patch (it does not affect aggregation behavior anyway), but at least you know where to look :)

phaag commented 1 year ago

The compression code is taken from a library. I need to check.

aldem commented 1 year ago

I doubt that this is related to compression itself (same code works without -a), most likely something is corrupted when doing aggregation, or there is some unresolved race in threading.

phaag commented 1 year ago

I moved this into a new issue, as this is not related to the original topic

phaag commented 1 year ago

I don't see a code bug in the larger file size. I guess this happens as a result of the byte sequence. Otherwise feel free to reopen.

phaag / nfdump

Aggregated flows consume more space than collected and other strange things with 1.7 #427