Closed thezoggy closed 2 months ago
It looks like, you did not read my comments at https://github.com/phaag/nfsen/issues/33
Between the two commits, there is no change which breaks any compression whatsoever.
You can test this by yourself by compiling both commits and run a collector:
% mkdir tmp
% src/nfcapd/nfcapd -w tmp -z=lz4
then run in another shell:
% nfreplay -r <any_flow_file>
Finally temintate (cntrl-C) the collector and check the resulting file:
% src/nfcapd/nfcapd -w tmp -z=lz4
Bound to IPv4 host/IP: any, Port: 9995
Init v5/v7: Default sampling: 1
Init v9: Max number of v9 tags enabled: 106, default sampling: 1
Init IPFIX: Max number of ipfix tags enabled: 97, default sampling: 1
Startup nfcapd.
Process_v9: New v9 exporter: SysID: 1, Domain: 1, IP: 127.0.0.1
^CIdent: 'none' Flows: 1009, Packets: 2649, Bytes: 748862, Sequence Errors: 0, Bad Packets: 0, Blocks: 0
Terminating nfcapd.
% nfdump -v tmp/nfcapd.*
File : tmp/nfcapd.202405201700
Version : 2 - lz4 compressed
Created : 2024-05-20 17:02:06
Created by : nfcapd
nfdump : f1070400
encryption : no
Appdx blks : 1
Data blks : 1
Checking data blocks
Checking appendix blocks
Total
Type 3 blocks : 2
Records : 1013
This works likewise for both commits:
% cd nfdump-9198d94/
% src/nfdump/nfdump -V
src/nfdump/nfdump: Version: 1.7.4-9198d94 Date: 2024-03-17 17:52:55 +0100
% cd ../nfdump-564f3a9
% src/nfdump/nfdump -V
src/nfdump/nfdump: Version: 1.7.4-564f3a9 Date: 2024-05-11 10:55:50 +0200
Regarding the missing lz4.h in config.log. As I explained in https://github.com/phaag/nfsen/issues/33
conftest.c:85:10: fatal error: 'lz4.h' file not found
#include <lz4.h>
^~~~~~~
1 error generated.
configure:16827: $? = 1
This is a test, if the local lz4 library is installed on the system. If so, it uses and links the system provided lz4 librray, otherwise it compiles the provided lz4 code. This is also identical in both commits. The compression lz4 is always included in the final binary.
I suspect, that you maybe have some local inconsistencies on your system or in NfSen not the latest version. The compression flags are controlled by NfSen, when starting up the collectors.
However, your case is not an nfdump/nfcapd issue.
I understood what you said, however this was not the case.
After upgrading and figuring out that the issue was nfcapd was not compressing the files I went back at the config log and saw the lz4 lib not being included. Which I thought was odd since lz4 was just previously working on previous version and lz4 working fine when I tried it out. So thought hey maybe liblz4-dev
was needed, installed that and when building nfdump from latest, I saw lz4 lib was included just fine... However after starting nfsen I saw nfcapd using lz4 compression like before, but shortly after a few files it stopped. Looking at processes I could see -z=lz4 is being passed and puzzled why it could compress just fine if I did it manually but why it was not doing it itself anymore.
I tried reverting nfdump to earlier versions as noted, but each time I kept having the same issue. I start nfsen, see nfcapd store flows with lz4... then shortly afterwards the new files would be uncompressed.. so lz4 no longer used. I was trying to isolate what change and reverted to various versions, but each time the same outcome. I gave up and went back to 9198d94
and that version, compression works reliably.
I honestly have no clue if its some sort of forking/threading issue where settings get loss or what, why I shared the configs/service, and also if there was some lib linking issue which is why I shared the process I follow to do the upgrade.
Again, check your NfSen installation and correct settings as well as using the latest github main branch. The ps output shows the running collectors and their arguments.
It is not an nfdump/nfcapd issue.
Again, check your NfSen installation and correct settings as well as using the latest github main branch. The ps output shows the running collectors and their arguments.
It is not an nfdump/nfcapd issue.
I shared what the ps output showed above, the command used then vs now is exactly the same (just different pid). Which i noted I see -z=lz4
being passed even when it was not working. nfsen is latest version, and was untouched when doing the nfdump upgrade. When reverting back to older nfdump the issue went away.
Looking back at logs around may 16th 1600 utc is when i stopped nfsen to upgrade nfdump something does stand out now that I'm digging through this. When nfsen does the process to update live profile, before upgrade fine, after upgrade exit 1, and then now back to older version back to being fine.
netflow4:/var/log> sudo grep -A5 "Update profile live in group" messages-20240519
...
May 16 15:55:15 netflow4.vm nfsen[810477]: Update profile live in group .
May 16 15:55:24 netflow4.vm nfsen[1297208]: Run expire at Thu May 16 15:55:00 2024
May 16 15:55:24 netflow4.vm nfsen[1297208]: Expire has 220s in this slot!
May 16 15:55:24 netflow4.vm nfsen[1297208]: Expire profile live group . low water mark: 90%
May 16 15:55:24 netflow4.vm nfsen[1297208]: End expire at Thu May 16 15:55:00 2024
May 16 15:57:22 netflow4.vm systemd[1]: Stopping NfSen Service...
--
May 16 16:00:16 netflow4.vm nfsen[810477]: Update profile live in group .
May 16 16:00:16 netflow4.vm nfcapd[14033]: Ident: 'router25' Flows: 612703, Packets: 4559730000, Bytes: 3714702190000, Sequence Errors: 381, Bad Packets: 0, Blocks: 0
May 16 16:00:16 netflow4.vm nfcapd[14033]: Terminating nfcapd.
May 16 16:00:17 netflow4.vm nfcapd[3847]: Ident: 'router67' Flows: 211020, Packets: 1818595000, Bytes: 1722365545000, Sequence Errors: 237, Bad Packets: 0, Blocks: 0
May 16 16:00:17 netflow4.vm nfcapd[3847]: Terminating nfcapd.
May 16 16:00:18 netflow4.vm nfcapd[16033]: Ident: 'router68' Flows: 4362, Packets: 37765000, Bytes: 29293850000, Sequence Errors: 0, Bad Packets: 0, Blocks: 0
--
May 16 16:07:14 netflow4.vm nfsen[1311241]: Update profile live in group .
May 16 16:07:14 netflow4.vm nfcapd[1310549]: Append new sampler id: -1, algorithm: 1, packet interval: 1, packet space: 4999
May 16 16:07:14 netflow4.vm nfsen[1311241]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
May 16 16:07:14 netflow4.vm nfsen[1311241]: message repeated 2 times: [ Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0]
May 16 16:07:14 netflow4.vm nfcapd[1311235]: Add new sampler id: -2, algorithm: 0, packet interval: 1, packet space: 4999
May 16 16:07:14 netflow4.vm nfcapd[1311235]: Process_ipfix: New ipfix exporter: SysID: 36, Observation domain 589827 from: 129.250.0.48
--
May 16 16:10:15 netflow4.vm nfsen[1311241]: Update profile live in group .
May 16 16:10:16 netflow4.vm nfcapd[1311172]: Append new sampler id: 1, algorithm: 2, packet interval: 1, packet space: 4999
May 16 16:10:17 netflow4.vm systemd[1]: Stopping User Manager for UID 0...
May 16 16:10:17 netflow4.vm systemd[1311990]: Stopped target Main User Target.
May 16 16:10:17 netflow4.vm systemd[1311990]: Stopped target Basic System.
May 16 16:10:17 netflow4.vm systemd[1311990]: Stopped target Paths.
--
May 16 16:15:15 netflow4.vm nfsen[1311241]: Update profile live in group .
May 16 16:15:15 netflow4.vm nfsen[1311241]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
May 16 16:15:17 netflow4.vm nfsen[1311241]: message repeated 262 times: [ Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0]
May 16 16:15:21 netflow4.vm nfsen[1314881]: Run expire at Thu May 16 16:15:00 2024
May 16 16:15:21 netflow4.vm nfsen[1314881]: Expire has 223s in this slot!
May 16 16:15:21 netflow4.vm nfsen[1314881]: Expire profile live group . low water mark: 90%
...
I wonder if I can correlate that each time it did that it caused nfdump to stop using compression.. as you can see first run fine then 2nd+ exit.. which matches up with what I was seeing.
and just like before currently, working fine:
May 18 23:55:15 netflow4.vm nfsen[322631]: Update profile live in group .
May 18 23:55:20 netflow4.vm nfsen[2640158]: Run expire at Sat May 18 23:55:00 2024
May 18 23:55:20 netflow4.vm nfsen[2640158]: Expire has 224s in this slot!
May 18 23:55:20 netflow4.vm nfsen[2640158]: Expire profile live group . low water mark: 90%
May 18 23:55:21 netflow4.vm nfsen[2640158]: End expire at Sat May 18 23:55:00 2024
Is there any recent changes to nfdump that would require updates to nfsen to make something play nice again? As nothing has changed in nfsen-dist.conf in awhile..
When I get some free time this week I'll try to replicate it and see if I can triage it better with your suggestions.
A few weeks back I switched to using zstd instead of lz4 compression, which has been working great. Figured I'd give it a shot at upgrading to latest code. Sadly just like last time when upgrading to newer code it breaks compression like before.
when using this version and earlier, all works fine:
nfdump -V
nfdump: Version: 1.7.4-9198d94 Options: ZSTD BZIP2 Date: 2024-03-17 17:52:55 +0100
stored as zstd just fine
netflow4:/data/nfsen/profiles-data/live/router1/2024/07/09> nfdump -v nfcapd.202407092100
File : nfcapd.202407092100
Size : 70073544
Version : 2 - zstd compressed
Created : 2024-07-09 21:01:23
Created by : nfcapd
nfdump : f1070400
encryption : no
Appdx blks : 1
Data blks : 209
Checking data blocks
Checking appendix blocks
Total
Type 3 blocks : 210
Records : 2362339
Updated local repo to latest code, built and installed.. same as I normally do:
sudo systemctl stop nfsen
cd ~/nfdump/
git restore m4/
git pull
./autogen.sh
./configure --enable-nfprofile --enable-maxmind --enable-readpcap --enable-nfpcapd --enable-sflow
make
sudo make install
sudo ldconfig
confirmed build looks good (config.log has nothing failing/errors), and version shows zstd listed:
----------------------------------
Build Settings for nfdump v1.7.4
----------------------------------
host type = linux-gnu
install dir = /usr/local
CC = gcc
CFLAGS = -g -O3 -std=gnu17 -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wmissing-noreturn -fno-strict-aliasing -pthread
CPPFLAGS =
LDFLAGS =
LIBS = -lresolv -lpthread -latomic -llz4 -lbz2 -lzstd
Enable liblz4 = yes
Enable libbz2 = yes
Enable libzstd = yes
Enable ja4 = no
Build geolookup = yes
Build torlookup = no
Build sflow = yes
Build nfpcapd = yes
Build nfprofile = yes
Build ft2nfdump = no
----------------------------------
nfdump -V
nfdump: Version: 1.7.4-53f9fa0 Options: ZSTD BZIP2 Date: 2024-07-07 17:11:17 +0200
Shortly after starting up nfsen, I see nfdump failed "Exit: 1":
...
Jul 9 21:02:00 netflow4 systemd[1]: Started NfSen Service.
Jul 9 21:02:00 netflow4 nfcapd[7575]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:00 netflow4 nfcapd[7575]: message repeated 2 times: [ Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999]
Jul 9 21:02:00 netflow4 nfcapd[7589]: Add new sampler id: -2, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:00 netflow4 nfcapd[7589]: Process_v9: New v9 exporter: SysID: 1, Domain: 33312, IP: <removed>
Jul 9 21:02:01 netflow4 nfcapd[7568]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:01 netflow4 nfcapd[7568]: message repeated 5 times: [ Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999]
Jul 9 21:02:01 netflow4 nfcapd[7568]: Add new sampler id: -2, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:01 netflow4 nfcapd[7568]: Process_ipfix: New ipfix exporter: SysID: 35, Observation domain 589828 from: <removed>
Jul 9 21:02:01 netflow4 nfcapd[7568]: Add new sampler id: -2, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:01 netflow4 nfcapd[7568]: Process_ipfix: New ipfix exporter: SysID: 36, Observation domain 589829 from: <removed>
Jul 9 21:02:01 netflow4 nfsen[7596]: 0 channels/alerts to profile
Jul 9 21:02:01 netflow4 nfsen[7596]: Update profile live in group .
Jul 9 21:02:01 netflow4 nfsen[7596]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
Jul 9 21:02:01 netflow4 nfsen[7596]: message repeated 5 times: [ Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0]
Jul 9 21:02:01 netflow4 nfcapd[7568]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:01 netflow4 nfcapd[7568]: message repeated 5 times: [ Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999]
Jul 9 21:02:01 netflow4 nfsen[7596]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
Jul 9 21:02:01 netflow4 nfsen[7596]: message repeated 7 times: [ Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0]
Jul 9 21:02:01 netflow4 nfcapd[7582]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:01 netflow4 nfcapd[7582]: message repeated 5 times: [ Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999]
Jul 9 21:02:01 netflow4 nfcapd[7582]: Add new sampler id: -2, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:01 netflow4 nfcapd[7582]: Process_ipfix: New ipfix exporter: SysID: 69, Observation domain 590085 from: <removed>
Jul 9 21:02:01 netflow4 nfsen[7596]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
Jul 9 21:02:01 netflow4 nfsen[7596]: message repeated 5 times: [ Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0]
Jul 9 21:02:01 netflow4 nfcapd[7582]: Add new sampler id: -2, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:01 netflow4 nfcapd[7582]: Process_ipfix: New ipfix exporter: SysID: 70, Observation domain 590341 from: <removed>
Jul 9 21:02:01 netflow4 nfsen[7596]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
Jul 9 21:02:03 netflow4 nfsen[7596]: message repeated 47 times: [ Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0]
Jul 9 21:02:03 netflow4 nfcapd[7568]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:03 netflow4 nfsen[7596]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
Jul 9 21:02:03 netflow4 nfcapd[7568]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:03 netflow4 nfcapd[7568]: message repeated 3 times: [ Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999]
Jul 9 21:02:03 netflow4 nfsen[7596]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
Jul 9 21:02:03 netflow4 nfcapd[7568]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:03 netflow4 nfsen[7596]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
Jul 9 21:02:03 netflow4 nfsen[7596]: message repeated 11 times: [ Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0]
Jul 9 21:02:03 netflow4 nfcapd[7582]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:03 netflow4 nfcapd[7582]: message repeated 5 times: [ Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999]
Jul 9 21:02:03 netflow4 nfsen[7596]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
Jul 9 21:02:03 netflow4 nfcapd[7582]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:03 netflow4 nfcapd[7582]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:03 netflow4 nfsen[7596]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
Jul 9 21:02:03 netflow4 nfcapd[7582]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:03 netflow4 nfcapd[7582]: message repeated 2 times: [ Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999]
Jul 9 21:02:03 netflow4 nfsen[7596]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
Jul 9 21:02:03 netflow4 nfcapd[7582]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:03 netflow4 nfsen[7596]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
Jul 9 21:02:03 netflow4 nfsen[7596]: message repeated 7 times: [ Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0]
Jul 9 21:02:03 netflow4 nfcapd[7568]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:03 netflow4 nfcapd[7568]: message repeated 2 times: [ Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999]
Jul 9 21:02:03 netflow4 nfsen[7596]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
Jul 9 21:02:03 netflow4 nfcapd[7568]: Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999
Jul 9 21:02:03 netflow4 nfcapd[7568]: message repeated 2 times: [ Append new sampler id: -1, algorithm: 0, packet interval: 1, packet space: 4999]
Jul 9 21:02:03 netflow4 nfsen[7596]: Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0
Jul 9 21:02:03 netflow4 nfsen[7596]: message repeated 4 times: [ Run nfdump failed: Exit: 1, Signal: 0, Coredump: 0]
resulting in files not being saved with compression
netflow4:/data/nfsen/profiles-data/live/router1/2024/07/09> nfdump -v nfcapd.202407092105
File : nfcapd.202407092105
Size : 591721047
Version : 2 - not compressed
Created : 2024-07-09 21:05:10
Created by : nfcapd
nfdump : f1070400
encryption : no
Appdx blks : 1
Data blks : 283
Checking data blocks
Checking appendix blocks
Total
Type 3 blocks : 284
Records : 3200598
Looking at ps aux, command looks valid and doesnt seem to lose arguments there:
/usr/local/bin/nfcapd -D -p 40190 -u www-data -g www-data -B 200000 -S 1 -P /data/nfsen/var/run/p40190.pid -z=zstd 3 -t 300 -I router1 -w /data/nfsen/profiles-data/live/router1 -s 5000
Gave up on triage for now, reverted back to git reset --hard 9198d94
built+installed. and now everything is working as expected again.
Could you please confirm if it works or not.
Could you please confirm if it works or not.
Will try it out
Could you please confirm if it works or not.
It appears to have fixed the issue. I updated to latest code, built+install just like normal. Started up nfsen and so far after 30+ mins its still using compression.
Question for ya though: I saw you added that debug message, as I am using nfsen to start the nfcapd instances.. how do I get it to enable the more verbose logging to be able to see that message (do i have to pass -v as opt arg when defining the sources)?
All these dbg_
statements are debug statements only for development purpose.
They cannot be enabled with a switch, as this is a compile time option. You do not want to run production code with dbg statements ..
moving issue from nfsen to here and cleaning up issue to focus on just the one problem here.
When upgrading from 9198d94 to latest (564f3a9) code, nfcapd stopped storing files in lz4 compression. Reverting back to 9198d94 and everything works as before.
upgraded nfdump from:
to
changes between git hashes: https://github.com/phaag/nfdump/compare/9198d94..564f3a9
I upgraded nfdump at 2024-05-16 ~1600 UTC, and while appeared fine initially but per graphs there you can see it struggled up until a day later box ran out of hdd space..
![Screen Shot 2024-05-17 at 4 42 58 PM](https://github.com/phaag/nfsen/assets/500882/d18a1e1e-a20b-4548-8c83-69b9d452d52a)
At first I thought maybe stuff stopped expiring but then later discovered that after the upgrade nfcapd was no longer compressing the data to lz4 like before. Which caused it to use up the rest of the free space on the box until everything broke down.
Looking at one router as example..
before upgrade ~115M a file, then post upgrade 450M a file
so went from lz4 compressed
to not compressed
This box is running: Ubuntu 20.04.6 LTS Have been using it to solely handle nfsen/nfdump as a netflow ingestion box.
The upgrade process:
I've done this process numerous times over the years without any issues, and for whatever reason this time you can see that in the config.log when I built out 564f3a9 that it failed to include lz4:
during this, you can see it failed check for lz4 but ended up with embedded support for lz4:
I looked back and saw that nothing really changed on the box just the normal security patches weeks ago but nothing recently since the last server boot as the box has been working fine upto this point and I see lz4 is still installed.
I thought maybe
liblz4-dev
needs to be installed now? Installed and tried rebuilding latest code again, and lz4 this time was built/included:do see:
Then also to rule out any gremlins I rebooted the box.
On boot, nfsen started and I saw that some nfcapd files being saved were compressed with lz4.
looking at nfsen status, I do see it is
-z=lz4
is being passed to nfcapdI can confirm one of the uncompressed files can be compressed with nfdump just fine manually:
But files again are no longer was being compressed. I did not see anything in logs to give me any clue why, so thinking it was just a code bug I tried reverting nfdump code to earlier versions (built+install) to try and isolate where it might have been included.
Tried:
Same issue, see lz4 being used at first:
Then checking an hour later, compression no longer being used:
Stopped and reverted nfdump to older commit:
Same issue, no compression..
Reverting back to the version I was running before all this started..
No everything using lz4 compression just fine, even hours/days later:
in case it helps, here are config/service info:
nfdump.conf
nfsen.conf
nfsen.service
cleaning up
To just document here to help others, due to amount of files (if doing layout 1) for the profile it meant that when you tried to do something like
nfdump -r live/ -J lz4
it would run and eventually stop once it hit x files (guessing argument list limit internally or something). So I just segment it to doing it per subfolder in the profile, and could also have it focus on a specific day with something like:During this command, various routers would toss a "appendix offset error" for "nfcapd.202405171645". I had to delete this nfcapd file before the above command could actually handle any of the files afterwards. So for me was easy to just delete anything smaller than 8k as we know that would be a 'bad' file:
Then just re-run command, and now since it wont get hung up on those files it should be able to convert all the files...
If you find that your files ownership changed, you can fixup by doing:
sudo chown <user>:<group> /data/nfsen/profiles-data/live -R
Now that all that is done, rebuild profile:
It makes me puzzled of what changes could cause compression to stop working, and oddly for it to work for a few files then stop.. when upgrading to newer code.