ntop / ntopng

Web-based Traffic and Security Network Traffic Monitoring
http://www.ntop.org
GNU General Public License v3.0
6.28k stars 656 forks source link

ntopng-5.0-stable repeatedly stops recording at ZMQ source #5867

Closed GrumpyOldNetworkGuy closed 3 years ago

GrumpyOldNetworkGuy commented 3 years ago

Hi,

Restarting ntopng reactivates recording on nprobe@xwin[0,1]. The other two ZMQ sources (LAN, DC) work perfectly. Interface-X-WIN The only strange thing I found is that minute.lua is causing problems at the same time: X-WIN-periodic-activities If it gets very bad, at the same moment the GUI stops working. The connection to TCP/3443 is then still up.

The TCP connection between nprobe @ xwin [0,1] is intact. But no data flows:

root@schnuffel:~# tcpdump -nni lo 'tcp port (5556 or 5557)'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
16:38:17.012950 IP 127.0.0.1.52584 > 127.0.0.1.5556: Flags [.], ack 1026771819, win 0, options [nop,nop,TS val 586928350 ecr 586897629], length 0
16:38:17.012956 IP 127.0.0.1.55016 > 127.0.0.1.5557: Flags [.], ack 3057400202, win 0, options [nop,nop,TS val 586928350 ecr 586897629], length 0
16:38:17.013001 IP 127.0.0.1.5556 > 127.0.0.1.52584: Flags [.], ack 1, win 256, options [nop,nop,TS val 586928350 ecr 586909914], length 0
16:38:17.013008 IP 127.0.0.1.5557 > 127.0.0.1.55016: Flags [.], ack 1, win 256, options [nop,nop,TS val 586928350 ecr 586907865], length 0
16:38:47.728936 IP 127.0.0.1.52584 > 127.0.0.1.5556: Flags [.], ack 1, win 0, options [nop,nop,TS val 586959066 ecr 586928350], length 0
16:38:47.728945 IP 127.0.0.1.55016 > 127.0.0.1.5557: Flags [.], ack 1, win 0, options [nop,nop,TS val 586959066 ecr 586928350], length 0
16:38:47.728964 IP 127.0.0.1.5556 > 127.0.0.1.52584: Flags [.], ack 1, win 256, options [nop,nop,TS val 586959066 ecr 586909914], length 0
16:38:47.728976 IP 127.0.0.1.5557 > 127.0.0.1.55016: Flags [.], ack 1, win 256, options [nop,nop,TS val 586959066 ecr 586907865], length 0
^C
8 packets captured
16 packets received by filter
0 packets dropped by kernel
root@schnuffel:~# 

To make things as easy as possible for ntopng, almost all checks have been disabled. By the way, apart from "unexpected DHCP", the checks shown here cannot be diabled. The error message is "Unknown user script ...". checks

Here is the configuration: configs.txt

How can I help you find the cause?

Thanks! Gerd

GrumpyOldNetworkGuy commented 3 years ago

Hi,

Our ntopng fails again like in the last few days.

simonemainardi commented 3 years ago

Seems minute.lua stalls. Please, post full nProbe and ntopng configurations. Also include recipients and endpoints if you are using them.

GrumpyOldNetworkGuy commented 3 years ago

@simonemainardi Here are the complete configuration files: schnuffel-conf.zip No recipients or endpoints are configured in addition to builtin_recipient_sqlite/builtin_endpoint_sqlite. No alarms have been recorded in the last 24 hours.

GrumpyOldNetworkGuy commented 3 years ago

@simonemainardi

--- 18:50 CEST ---

Now I started ntopng with only nprobe@xwin0 and nprobe@xwin1 as the source:

--- ntopng.conf ---
#
#-i=br980
-i=tcp://127.0.0.1:5556,tcp://127.0.0.1:5557
#-i=tcp://127.0.0.1:5558
#-i=tcp://127.0.0.1:5559
#

Let's wait for the next standstill.

How do you think. Should I upgrade to the nightly builds packages and test them out?

GrumpyOldNetworkGuy commented 3 years ago

@simonemainardi --- 02:10 CEST ---

Interface-X-WIN X-WIN-minute lua Script-failure

Now I start ntopng with a direct connection of zc:ens2f0@0,zc:ens2f0@1 without nprobe:

--- ntopng.conf ---
#
#-i=br980
#-i=tcp://127.0.0.1:5556,tcp://127.0.0.1:5557
-i=zc:ens2f0@0,zc:ens2f0@1
#-i=tcp://127.0.0.1:5558
#-i=tcp://127.0.0.1:5559
#

I'm curious...

GrumpyOldNetworkGuy commented 3 years ago

@simonemainardi It's crazy, but now I was able to turn off the rest of the ckecks:

Checks-Flow-Disabled

simonemainardi commented 3 years ago

Do you have top sites or traffic behavior enabled?

GrumpyOldNetworkGuy commented 3 years ago

Both are switched off.

In the current test, ntopng drops packets like hell: Packets-vs-Drops

Interface-X-WIN-drops

GrumpyOldNetworkGuy commented 3 years ago

@simonemainardi --- 11:17 CEST ---

So ZMQ is probably not the problem. Interface-X-WIN2 X-WIN-minute lua2

GrumpyOldNetworkGuy commented 3 years ago

@simonemainardi I restarted ntopng. Now it is using inluxdb on another host. The other settings are the same as in the previous experiment.

GrumpyOldNetworkGuy commented 3 years ago

@simonemainardi --- 19:44 CEST ---

Running influxdb on another host didn't help either.

simonemainardi commented 3 years ago

Can we arrange a remote session to troubleshoot? As soon as you find the process stuck again drop me an email at mainardi at ntop dot org and I can connect.

simonemainardi commented 3 years ago

I've fixed an issue possibly causing this behavior. New dev and stable builds are in progress, please hold on a couple of hours, update, and let me know. Thanks for providing the machine for the debugging.

simonemainardi commented 3 years ago

@GrumpyOldNetworkGuy can you please confirm and close this issue is solved? Thanks.

GrumpyOldNetworkGuy commented 3 years ago

New uptime record for ntopng-5.0-stable: 2 Days, 16:01:04! Thank you @simonemainardi for fixing this issue.