ntop / nProbe

Open source components and extensions for nProbe
http://ntop.org
GNU General Public License v2.0
1.66k stars 44 forks source link

nprobe is creating too many tiny files #517

Open g1l3sp opened 2 years ago

g1l3sp commented 2 years ago

Hello,

We're running nProbe to dump flow files like this:

/usr/bin/nprobe -i fbcard:0:a06 --verbose 1 --max-log-lines 100000 --dump-path /u01/flow/raw/fbcard-0-a06/ --collector none --dump-format t --dont-nest-dump-dirs --dont-drop-privileges --smart-udp-frags --hash-size 524288 --max-num-flows 1073741823 -V 10

As you can see, we have a fiberblaze card that we are collecting from. Usually, we get files that are nicely sized at around 14MB, but for some reason, nProbe eventually (usually after running for quite a while) gets into a state where it is dumping tons of tiny files, sometimes thousands per second. When this happens, it is creating big problems on the host that is running nProbe, as having millions of files in a single directory makes many things difficult.

Here's an example where nProbe has created over 3000 tiny files in a single second:

user:/u01/flow/raw/fbcard-0-a06$ ls -Fal | grep "20211209_151159.*flows$" | head
-rw-r--r--  1 root   root        504 Dec  9 15:11 20211209_151159-1000.flows
-rw-r--r--  1 root   root        654 Dec  9 15:11 20211209_151159-1001.flows
-rw-r--r--  1 root   root       2831 Dec  9 15:11 20211209_151159-1002.flows
-rw-r--r--  1 root   root        516 Dec  9 15:11 20211209_151159-1003.flows
-rw-r--r--  1 root   root        644 Dec  9 15:11 20211209_151159-1004.flows
-rw-r--r--  1 root   root        706 Dec  9 15:11 20211209_151159-1005.flows
-rw-r--r--  1 root   root       3674 Dec  9 15:11 20211209_151159-1006.flows
-rw-r--r--  1 root   root        516 Dec  9 15:11 20211209_151159-1007.flows
-rw-r--r--  1 root   root        518 Dec  9 15:11 20211209_151159-1008.flows
-rw-r--r--  1 root   root       4490 Dec  9 15:11 20211209_151159-1009.flows
user:/u01/flow/raw/fbcard-0-a06$ ls -1 | grep "^20211209_151159.*flows$" | wc -l
3012
user:/u01/flow/raw/fbcard-0-a06$

I notice that this seems to happen almost exlusively in the 58th and 59th second of the minute, so it seems like it may have to do with the way a boundary condition is being handled in the code.

Here's some output from a script I wrote to try to clean up one of the directories where this happened. It is looking for clusters of files created in a single second, and then catting them together while counting how many there were. The output here is the first filename that came up in the directory listing for that second worth of data, and then the number of files that were dumped out :

20211209_185559-1000.flows 1538
20211209_185657-10.flows 43
20211209_185658-1000.flows 1641
20211209_185659-10.flows 12
20211209_185758-100.flows 560
20211209_185759-1000.flows 1097
20211209_185858-100.flows 503
20211209_185859-100.flows 341
20211209_185958-100.flows 595
20211209_185959-100.flows 601
20211209_190058-100.flows 521
20211209_190059-100.flows 514
20211209_190158-100.flows 605
20211209_190159-100.flows 481
20211209_190258-100.flows 699
20211209_190259-100.flows 548
20211209_190358-100.flows 397
20211209_190359-100.flows 751
20211209_190458-100.flows 797
20211209_190459-1.flows 9
20211209_190558-100.flows 553
20211209_190559-100.flows 376
20211209_190758-100.flows 592
20211209_190759-100.flows 505
20211209_190858-100.flows 684
20211209_190859-100.flows 625
20211209_190958-100.flows 411
20211209_190959-100.flows 437
20211209_191058-100.flows 571
20211209_191059-100.flows 464
20211209_191158-100.flows 411
20211209_191159-100.flows 373
20211209_191258-100.flows 481
20211209_191259-100.flows 703
20211209_191358-100.flows 397
20211209_191359-100.flows 451
20211209_191458-100.flows 281
20211209_191459-100.flows 295
20211209_191558-100.flows 569
20211209_191658-1.flows 6
20211209_191659-100.flows 362
20211209_191758-100.flows 373
20211209_191759-100.flows 443
20211209_191858-100.flows 509
20211209_191859-100.flows 528
20211209_191958-100.flows 325

So it's definitely looking like something happens when the end of the minute rolls around that is causing this issue.

Some basic version information:

nProbe version info: Version: 9.6.211108 Build OS: Ubuntu 20.04.3 LTS GIT rev: 9.6-stable:d057e2f48cfd28ed7c57b7db18a9228b8a4e6fa6:20211108 Edition: nProbe Pro

If there is more information that I can provide, please let me know.

g1l3sp commented 2 years ago

Please let me know if I can clarify any of the information above, or gather more information that will be useful. This is a continuing issue for us that we're remediating with "bandage" scripts.

g1l3sp commented 2 years ago

I updated the Fiberblaze software version to the latest (3.6.1.1) in the hopes that would help. I also confirmed we are on the latest firmware for the NIC. Restarting the host and starting the processes up again did temporarily correct the situation, but after about 10 days running, we started seeing it again: hundreds or sometimes thousands of tiny files dumped out by nprobe in the 59th second of the minute.

I should add that nprobe and related packages were also updated to the latest stable versions as a part of this. As of now we are on nprobe v.9.6.211220.

g1l3sp commented 6 months ago

It's more than two years later, and this issue still occurs regularly (maybe every couple of weeks). My mitigation is to kill nprobe, reload the fiberblaze driver, start nprobe back up, and run my script that concatenates those files together into larger files based on the second that they occurred. If I don't reload the fiberblaze driver and just restart nprobe, the issue persists, which seems to indicate some sort of state in the driver that doesn't play nicely with nprobe. However, nprobe is the one creating the many small files, the majority of which have between 1 and 5 flows in them.

If there is any will to get to the bottom this, I'm glad to do what I can to assist.