ntop / n2disk

Open source components and extensions for n2disk
498 stars 11 forks source link

n2disk is filling napatech buffer and showing unexpected behavior #17

Open igorribeiroduarte opened 4 years ago

igorribeiroduarte commented 4 years ago

I'm currently running n2disk without a license, just for testing, so the service goes down at every 5 minutes. I have napatech running inside a container and n2disk running inside another container, both services are being orchestrated by docker swarm. At the beginning n2disk works very well, capturing all my network traffic without dropping any packet, after 5 minutes the n2disk service goes down (as expected) and swarm brings the service back, this process is repeated indefinitely. After some time (it may takes minutes or hours), without any special event or throughput peak, napatech buffer reaches 100% and n2disk stops not only recording packets but also stops restarting at every 5 minutes (The service keeps up until I manually kill the process). Restart n2disk service doesn't solve the problem, as soon as n2disk is up, napatech buffer reaches 100% again and the problem remains. The services only gets back to the expected behavior after killing napatech AND n2disk service.

Below the output of /proc/net/pf_ring/stats/16004-none.383 file:

Duration: 0:02:46:01:446 Throughput: 0.00 Mpps 0.00 Gbps Packets: 0 Filtered: 0 Dropped: 23371139 Bytes: 0 DumpedBytes: 0 DumpedFiles: 0 SlowSlavesLoops: 0 SlowStorageLoops: 0 CaptureLoops: 0 FirstDumpedEpoch: 0 LastDumpedEpoch: 0

igorribeiroduarte commented 4 years ago

I forgot to say that I'm using n2disk1g

cardigliano commented 4 years ago

@igorribeiroduarte I've not been able to reproduce this yet, it seems it happens under certain condition, I will keep it under testing. However it seems that killing n2disk sometimes leaves the napatech stream in some inconsistent state leading to loops in the Napatech service, this is my assumption according to your symptoms. Setting a valid license to n2disk should avoid this situation.

igorribeiroduarte commented 4 years ago

@cardigliano I've been making some tests and the loop seems to be on n2disk service and not on napatech, because we have other applications reading from nt buffer and they keep working correctly after n2disk gets killed (after filling nt buffer). I already added n2disk and pfring licenses, but this still a problem for us, since sometimes we need to restart our stack. Alongside with that, I think the problem may be with n2disk initialization and not with the way n2disk is being killed, since n2disk seems to stop gracefully (I tested with SIGINT and SIGTERM) right before the problem happens.

cardigliano commented 4 years ago

@igorribeiroduarte could you provide the n2disk configuration file? Are you using port or stream in as interface in the configuration?

igorribeiroduarte commented 4 years ago

@cardigliano I didn't know that was a configuration file for n2disk. Where can I read about it? I wasn't able to find on documentation. I'm using the following arguments to run: n2disk1g -I -A index_dir -p 1024 -b 1024 -i nt:1 -n 1000 -m 1000 -t 15 -O /tmp -o /disco03 -o /disco04

As you can see, I'm reading directly from the port

cardigliano commented 4 years ago

@igorribeiroduarte please check this guide for the configuration file http://www.ntop.org/guides/n2disk/how_to_start.html

igorribeiroduarte commented 4 years ago

@cardigliano thanks, but it's just a way to preset the arguments, right? It doesn't affect the bug we're discussing, correct?

cardigliano commented 4 years ago

Correct

lucasbaile commented 3 years ago

Any news on this bug? It's been happening quite often with us

cardigliano commented 3 years ago

@lucasbaile what's happening in your case exactly? Do you have issues after restarting n2disk?

lucasbaile commented 3 years ago

@cardigliano The situation is exactly as described by @igorribeiroduarte. When running n2disk1g, with or without a valid license, the n2disk1g binary seems to get stuck when it tries to gracefully teardown, holding the Napatech buffer, thus causing all the data on that stream to be dropped. The only action that seems to surely fix the situation is sending a SIGKILL to the process, but this is quite annoying when trying to automate some processes. I'll leave some more info to try and help. Any other info needed, just let me know.

Napatech Model: NT20E3-2-PTP n2disk1g Version: v.3.4.200731 (r5214) n2disk1g Command: n2disk1g -I -P /var/run/n2disk/n2disk_4.pid -G 1 -A index_dir -p 1024 -b 1024 -i nt:stream4 --disk-limit 20% -t 15 -o /data/task_4