phaag / nfdump

Netflow processing tools
Other
771 stars 202 forks source link

ZSTD compression #444

Closed thezoggy closed 1 year ago

thezoggy commented 1 year ago

Zstandard (aka zstd)

I mention zstd as a possible option to add as the guy that created lz4, created zstd. Just its a bit more modern and offers more flexibility in compression levels.

https://en.wikipedia.org/wiki/Zstd

From quick glances it looks like it would offer slightly better compression with minimal hit. Which could be decent space savings for those that run decently size instances (like we have ~35TB for 14 days retention with our sampling rate). Especially since our storage is all ssd on a vsan cluster with plentiful resources.

aldem commented 1 year ago

Actually I have a patch that does exactly this. It is a bit unpolished for now, but it works.

The compression speed is good enough to use it on collection (bz2 is too slow), and space savings are quite impressive comparing to LZO/LZ4.

I could try to prepare a PR but it probably will take time.

phaag commented 1 year ago

Thanks for the offer, but it’s not a big deal to implement. In a few days, it should be done.

aldem commented 1 year ago

Fine, less work for me :)

But please think about configurable compression level, something like -z=zstd:level or similar.

phaag commented 1 year ago

The latest master contains code to integrate zstd compression. Make sure, you have libzstd installed and can be found by configure, otherwise use --with-zstdpath=PATH

At the end of the configure run, you see, if libzstd is included or not. Up to now, I is not mandatory.

Use -z=zstd or -z=zstd:<level> to enable zstd compression.

Notes: lz4 and zstd accept compression levels. Due to the exponential compression time, you should not go beyond -z=lz4:9 or -z=zstd:10. Best compression is still bz2. Even zstd:22 can not beat it. It is recommended to work without compression levels, unless you know, what you do.

A few time numbers to compress an uncompressed file of 28.7 Mio flows. The file is uncompressed and just compression is changed with -J


Uncompressed:
-rw-r--r--@ 1 peter  staff  2528039869 May 27 21:14 flows.nf

time src/nfdump/nfdump -r flows.nf -J lzo
real    0m3.617s user   0m3.356s sys    0m0.526s
-rw-r--r--@ 1 peter  staff  847153982 May 27 18:46 flows.nf

time src/nfdump/nfdump -r flows.nf -J lz4
real    0m3.301s user   0m3.041s sys    0m0.535s
-rw-r--r--@ 1 peter  staff  846668396 May 27 18:46 flows.nf

time src/nfdump/nfdump -r flows.nf -J bz2
real    2m4.210s user   2m2.796s sys    0m1.691s
-rw-r--r--@ 1 peter  staff  443767914 May 27 18:48 flows.nf

time src/nfdump/nfdump -r flows.nf -J zstd
real    0m5.573s user   0m5.264s sys    0m0.565s
-rw-r--r--@ 1 peter  staff  550447050 May 27 18:49 flows.nf

time src/nfdump/nfdump -r flows.nf -J lz4:3
real    0m3.292s user   0m3.040s sys    0m0.517s
-rw-r--r--@ 1 peter  staff  846668396 May 27 18:49 flows.nf

time src/nfdump/nfdump -r flows.nf -J zstd:3
real    0m5.629s user   0m5.287s sys    0m0.602s
-rw-r--r--@ 1 peter  staff  550447050 May 27 18:49 flows.nf

time src/nfdump/nfdump -r flows.nf -J lz4:9
real    1m4.606s user   1m4.024s sys    0m0.840s
-rw-r--r--@ 1 peter  staff  612697129 May 27 18:50 flows.nf

time src/nfdump/nfdump -r flows.nf -J zstd:9
real    0m22.741s user  0m22.328s sys   0m0.696s
-rw-r--r--@ 1 peter  staff  504084398 May 27 18:51 flows.nf

time src/nfdump/nfdump -r flows.nf -J lz4:12
real    7m43.273s user  6m37.387s sys   0m1.322s
-rw-r--r--@ 1 peter  staff  608247491 May 27 18:58 flows.nf

time src/nfdump/nfdump -r flows.nf -J zstd:22
real    95m8.672s user  7m20.529s sys   0m2.024s
-rw-r--r--@ 1 peter  staff  454430966 May 27 20:34 flows.nf

Please let me know, if this works for you. I am thinking of removing the compression levels again, as they are of little use at least for real time collecting.

aldem commented 1 year ago

Great news, thank you!

But please don't remove compression level tuning, at least for zstd - depending on CPU power available it makes sense to be able to adjust it for real-time collection.

When I switched to zstd on collection (level 5) and in comparison to lz4 it saves me approximately half of the space with comparable CPU usage. Actually it could be tuned even more for netflow collection (there are lots of knobs) but even in simple case it saves a lot already.

Despite that bz2 gives the best compression it is so slow that makes little sense unless you could make it parallel and have a lot of CPU power - in single threaded mode it is simply too slow. For instance in my case it takes more than 2 minutes to compress 1 minute of collected data, and it makes little sense to dedicate several cores just for this purpose, especially in spite of current electricity costs - high-end CPU on full load easily consumes more than 150 w/hour, and this is just for one collector. It is also using more CPU and slower on decompression.

thezoggy commented 1 year ago

i believe the readme section of compression also needs updating https://github.com/phaag/nfdump#compression

phaag commented 1 year ago

@thezoggy - right - the Sponsor link is not yet prominent enough :)

phaag commented 1 year ago

@aldem - The point is, that neither lz4 nor zstd are usable in their max level compression. bz2 is way faster and better than lz4 or zstd above a specific level. Therefore it may be useful to clamp the level at some point. As of threading - several parallel writer threads can still improve write performance at the cost of CPU, if this is an issue. Up to now, a processing and writer thread handle all the data. More writer threads could be added. I am not yet sure, what the best or sweet-spot number of threads is. I need some more testing.

aldem commented 1 year ago

@phaag I agree, the cap might be necessary, but I got an impression that you want to remove tuning completely.

In terms of the number of threads, if I were implementing this, I would allow the option to specify the desired number of threads. Personally, I think hard-coding anything that can be easily changed at runtime is not a good idea.

phaag commented 1 year ago

Well - there any many options - instead of integer levels, options could be fast, medium, efficient. Three levels should fit almost for anybody, and these levels can be hard wired internally. It makes it easier for the user, as the levels may already be properly adjusted. Having too many options may not be good or too frustrating to find the right one. The same applies for the number of writing threads. These depend on the number of flows you want to receive per sec and store them properly compressed. The question is simply - is one thread good enough, or do we need more. If many threads are required, then it makes no difference if 2 or 20. Maybe there are 19 inactive and 1 working. So why complicating the code, if one is just good enough. I have not yet found an answer and need more testing.

aldem commented 1 year ago

I doubt that you could test all possible combinations of compiler, CPU model/architecture and inputs to choose the "right" values for fast/medium/efficient - depending on many factors your choice of "fast" may be too slow for specific scenario while "efficient" might be underusing available CPU power.

I could understand your doubts if you have to do a lot of coding and subsequent maintenance to provide support for various levels, but in this particular case it is one-time effort - and you give the user full control while still providing some sensible defaults (or meta-levels) based on your own testing. zstd (the app) has so many options for a reason - and the reason is to give us, users, the freedom to adjust to our needs and specific scenarios. Nothing is more frustrating that the lack of choice when it is expected and could be provided almost for free :)

Likewise, configurable at runtime number of threads allows to fine-tune for specific scenarios - high-traffic, low-traffic, etc - but you could never be sure that the value you choose based on your own experience will fit specific use case. Maybe someone has 64-core ARM CPU and is willing to dedicate half of those cores to nfdump, while someone will be unhappy to allocate more than 2-4 cores even with 128-core CPU.

If compression takes more than half of time needed for collection - I would spawn additional thread anyway (up to the specified maximum), just to make sure that everything that is collected will be stored in time, without any lag ( I had such cases already and it hit me badly).

Even for re-compression it still makes sense to allow tuning - some may have enough cores to re-compress in real-time with bz2, some will never use compression at all or use only lz4.

gabrielmocan commented 1 year ago

@phaag now it's mandatory to have zstd in order to compile from source? Can I disable it? Rather not add yet another dep on my container...

phaag commented 1 year ago

So far, it’s not mandatory. If configure does not find it, it’s not integrated. You see it in the summary at the end of the configure run. As of now, libbzip2 is mandatory, so I would like to streamline the two. Either both optional or mandatory. Any preferences?

phaag commented 1 year ago

As of the number of writer threads, this could be implemented reasonably fast, as the code is already pretty close. This will improve compressed writing performance. On low volume networks, these additional threads will sleep and not adding any CPU load. The same goes for the reader threads. Those are much lighter but still could improve nfdump throughput.

phaag commented 1 year ago

The current master repo adds dynamically writer threads. In nfdump you can limit them by using argument -W. All collectors have dynamic writer threads enabled at startup. The number of threads depends on the number of cores online on the current kernel. @All - Please test report back any bugs.

gabrielmocan commented 1 year ago

@phaag optional gives more flexibility. In my use case, I don't use compression at all.

phaag commented 1 year ago

If no compression is enabled, then the number of writer is automatically limited to 1. There is no difference to before. However, if you use -z=lz4 the speed is next to identical to no compression. So feel free to play.

gabrielmocan commented 1 year ago

I'm just trying to reduce dependencies to get a slimmer Docker image =D

gabrielmocan commented 1 year ago

As of now my final image has to have libbz2-dev otherwise nfcapd/sfcapd won't run, even though I don't use compression.

background: I don't store raw nfcapd files, I process them with go-nfdump and store on QuestDB (https://github.com/questdb/questdb) then I delete raw files.

aldem commented 1 year ago

@phaag Not exactly a "bug", but... tried to run with -r flows-10m -J zstd:5 -w flows-10m.zstd (was my mistake - actually I wanted to use -zzstd:5), and got:

File flows-10m compression changed

However, the compression wasn't changed, there was no output file named flows-10m.zstd but instead there was a file flows-10m-tmp.

Otherwise (with proper arguments) everything works as expected (didn't try collection yet), the multi-core speed gains are impressive. πŸ‘

phaag commented 1 year ago

Yes - as of now and as of historic reason, libbz2 is mandatory. As I wrote above, I could imaging to make both external compression libraries optional.

@phaag Not exactly a "bug", but... tried to run with -r flows-10m -J zstd:5 -w flows-10m.zstd (was my mistake - actually I wanted to use -zzstd:5), and got:

File flows-10m compression changed

However, the compression wasn't changed, there was no output file named flows-10m.zstd but instead there was a file flows-10m-tmp.

Otherwise (with proper arguments) everything works as expected (didn't try collection yet), the multi-core speed gains are impressive. πŸ‘

@aldem -r .. -J .. recompresses the file. Anything else on the cmd line is ignored. The ..-tmp file you see is the temporary file which is created with the new compression and renamed back to the old file.

Yes - the gain is rather substantial - that's what are cores for :)

gabrielmocan commented 1 year ago

@phaag thumbs up for making compression libraries optional πŸ‘πŸ»

aldem commented 1 year ago

@phaag Well, I found the problem why it was not renamed in my case - the original file was not writable (different user), though I would expect to get a clear message about this issue (you don't check if rename() is successful).

phaag commented 1 year ago

@phaag Well, I found the problem why it was not renamed in my case - the original file was not writable (different user), though I would expect to get a clear message about this issue (you don't check if rename() is successful).

This can be fixed ..

phaag commented 1 year ago

@thezoggy - I will close this ticket unless you have further input. I hope zstd works for you.

thezoggy commented 1 year ago

might need to update the docker packages to reflect adding the needed zstd libs..

ubuntu: libzstd1? libzstd-dev? https://github.com/phaag/nfdump/blob/master/extra/docker/Dockerfile.ubuntu

alpine: zstd-libs? zstd-dev? https://github.com/phaag/nfdump/blob/master/extra/docker/Dockerfile.alpine

thezoggy commented 1 year ago

Using random router single time bucket, 1.2G

setup

Normally we use lz4, so uncompressed the file to use as a base. When compressed, rename resulting file to make it easier to see result.

:~/nfdump/src/nfdump/nfdump -r nfcapd.202306010000 -J 0

Compressed using default. ~The times might be slightly unreliable since this is a live box with plenty of stuff going on but should still work for what I'm focusing on.~ Redid the work with everything stopped so I could have more accurate times.

time ~/nfdump/src/nfdump/nfdump -r nfcapd.202306010000 -J lzo
File nfcapd.202306010000 compression changed
3.436u 1.245s 0:01.63 286.5%  0+0k 456+751296io 5pf+0w

time ~/nfdump/src/nfdump/nfdump -r nfcapd.202306010000 -J lz4
File nfcapd.202306010000 compression changed
3.246u 1.291s 0:01.97 229.9%  0+0k 0+736648io 0pf+0w

time ~/nfdump/src/nfdump/nfdump -r nfcapd.202306010000 -J bz2
File nfcapd.202306010000 compression changed
151.111u 3.831s 0:10.33 1499.9% 0+0k 8+329016io 1pf+0w

time ~/nfdump/src/nfdump/nfdump -r nfcapd.202306010000 -J zstd
File nfcapd.202306010000 compression changed
7.186u 1.243s 0:02.34 359.8%  0+0k 0+463592io 0pf+0w

Can see that zstd takes less time to compress (does use more cpu), but does result in a smaller file than lz4.

results:

:~/test-nflow> ls -alh --sort=size
-rw-r--r--  1 zog noc 1.2G Jun  3 03:48 nfcapd.202306010000.none
-rw-r--r--  1 zog noc 367M Jun  3 04:11 nfcapd.202306010000.lzo
-rw-r--r--  1 zog noc 360M Jun  3 03:50 nfcapd.202306010000.lz4
-rw-r--r--  1 zog noc 227M Jun  3 03:51 nfcapd.202306010000.zstd
-rw-r--r--  1 zog noc 161M Jun  3 03:51 nfcapd.202306010000.bz2

lz4 (compression levels [3-12] (default: 9) zstd (compression levels [1-19] (default: 3)

While lz4 already takes more time than zstd, trying lz4:9 we see it takes much more cpu+time and still results in larger file than default zstd. If we increase zstd compression level to 5, its still faster than lz4 (default) time. Once you do level 6 it then can become slower than lz4 (default).

If we do zstd:19 we can see it takes significantly longer and comes close to bz2 size. Also supposed to use a lot more memory during the process.

time ~/nfdump/src/nfdump/nfdump -r nfcapd.202306010000 -J lz4:12
File nfcapd.202306010000 compression changed
202.121u 1.351s 0:13.47 1510.5% 0+0k 160+491040io 1pf+0w

time ~/nfdump/src/nfdump/nfdump -r nfcapd.202306010000 -J zstd:5
File nfcapd.202306010000 compression changed
14.699u 1.200s 0:02.86 555.5% 0+0k 176+444032io 1pf+0w

time ~/nfdump/src/nfdump/nfdump -r nfcapd.202306010000 -J zstd:6
File nfcapd.202306010000 compression changed
21.752u 1.376s 0:02.14 1080.3%  0+0k 0+436192io 0pf+0w

time ~/nfdump/src/nfdump/nfdump -r nfcapd.202306010000 -J zstd:19
File nfcapd.202306010000 compression changed
777.775u 18.165s 0:50.37 1580.1%  0+0k 200+348272io 1pf+0w

results:

:~/test-nflow> ls -alh --sort=size
total 2.6G
-rw-r--r--  1 zog noc 1.2G Jun  3 03:48 nfcapd.202306010000.none
-rw-r--r--  1 zog noc 367M Jun  3 04:11 nfcapd.202306010000.lzo
-rw-r--r--  1 zog noc 360M Jun  3 03:50 nfcapd.202306010000.lz4
-rw-r--r--  1 zog noc 240M Jun  3 04:10 nfcapd.202306010000.lz4_12
-rw-r--r--  1 zog noc 227M Jun  3 03:51 nfcapd.202306010000.zstd
-rw-r--r--  1 zog noc 217M Jun  3 04:16 nfcapd.202306010000.zstd_5
-rw-r--r--  1 zog noc 213M Jun  3 04:16 nfcapd.202306010000.zstd_6
-rw-r--r--  1 zog noc 171M Jun  3 04:28 nfcapd.202306010000.zstd_19
-rw-r--r--  1 zog noc 161M Jun  3 03:51 nfcapd.202306010000.bz2

So from this quick test, it looks like on modern hardware that using zstd (default) can give you faster compression with smaller files than even lz4 is doing.

result (table form):

compression comp level file size % comp comp elapse time (m:ss.sss)
* none - 1.2G - -
* lzo - 367M 68.44% 0:01.63
* lz4 9 360M 69.05% 0:01.97
lz4 12 240M 79.36% 0:13.47
* zstd 3 227M 80.48% 0:02.34
zstd 5 217M 81.34% 0:02.86
zstd 6 213M 81.69% 0:02.14
zstd 19 171M 85.30% 0:50.37
* bz2 - 161M 86.16% 0:10.33
thezoggy commented 1 year ago

@thezoggy - right - the Sponsor link is not yet prominent enough :)

Thanks again for the work on this project. I've passed along a request to my management about submitting a donation 😺

phaag commented 1 year ago

@thezoggy -thanks for your detailed tests! Would it be possible to add the compression time to the result table? That would make it perfekt!

thezoggy commented 1 year ago

I know I can modify existing nfcapd files to use zstd.. but it doesnt look like you can just have it stored in zstd by default?

Use one compression: -z for LZO, -j for BZ2 or -y for LZ4 compression

I do not see a -z=zstd[:level] option in nfdump (btw missing space before "compressed" there in the -J option)

-J <num>    Modify file compression: 0: uncompressed - 1: LZO - 2: BZ2 - 3: LZ4 - 4: ZSTDcompressed.
-z=lzo      LZO compress flows in output file.
-z=bz2      BZIP2 compress flows in output file.
-z=lz4[:level]  LZ4 compress flows in output file.

nfcapd

-z=lzo      LZO compress flows in output file.
-z=bz2      BZIP2 compress flows in output file.
-z=lz4[:level]  LZ4 compress flows in output file.
thezoggy commented 1 year ago

@thezoggy -thanks for your detailed tests! Would it be possible to add the compression time to the result table? That would make it perfekt!

stopped everything on box and redid the compressions and updated table with the elapsed times. also updated previous comment with new outputs.

i did notice when nfdump is doing the compression it looks like it only uses 16 cpu out of the 64 cpu i have on this box, no matter which compression or settings I was doing. is that to be expected? im guessing this is because of MAXWORKERS being set to 16?

phaag commented 1 year ago

Many thanks @thezoggy for the update! It's very appreciated!

As of MAXWORKERS - Yes,I introduced this value, as I was not sure, if compression should use all of the available CPUs available or not, therefore I set a limit to 16 to be more user friendly. One solution could be, that MAXWORKERS could be set in nfsen.conf. Therefore people are free to adjust if requested. Would you like to use more than 16 CPUs at a time?

thezoggy commented 1 year ago

The box has 64, which are somewhat dedicated to it on the vsan. I wanted to see if 64/32/16 cpu made much difference. 😁

phaag commented 1 year ago

Ok - nfdump.conf now takes an optional parameter for maxworkers. It can be set separately for nfdump and nfcapd.