Closed kenatgcc closed 5 years ago
Can you post a screenshot of the page with error?
Screenshots added.
What's the latest output of sudo journalctl -u ntopng
?
Attached.
The log is truncated on the right, please post again
Can you post a screenshot of the "SNMP Devices" view? What is the value of "Time Since Last Poll"? Inside the SNMP device configuration, under the cog icon, is "Device Polling" enabled?
Attached.
Can you provide remote gui access to debug this more easily? Please also see https://www.ntop.org/guides/ntopng/remote_assistance.html .
Are there any recent alerts for SNMP devices under the alerts icon above? The service log does not seem to be the most recent one, can you double check using journalctl and scrolling the view to the bottom? What's the output of sudo systemctl status ntopng
?
Attached is the outputs you requested. journalctl3.txt systemctl_status_ntopng.txt
Attached.
SNMP alerts screen shot
I have loaded the n2n package, when UTC will you have time to connect? n2n_assistance.zip
Please disable it and enable again to reset the credentials, send me the credentials privately at faranda@ntop.org, I can connect now.
Edit: make sure to enable gui admin access or provide via email gui user credentials
Should I turn on InfluxDB?
Yes
Do you have a link with InfluxDB instructions?
Or do I need to install the package? From what I read here, it looks like it should be automatically configured. https://www.ntop.org/guides/ntopng/basic_concepts/timeseries.html#influxdb-driver
You need to install influxdb 1.7 from https://portal.influxdata.com/downloads . Then you can use the instructions above to enable it into ntopng.
Okay, that's what I figured. What about nProbe running on the ntopng server, is that required? If so, should I be using a licensed Pro version so it's not limited in flow capacity?
You need nprobe only if you are capturing netflow/sflow traffic, please check out https://www.ntop.org/nprobe/network-monitoring-101-a-beginners-guide-to-understanding-ntop-tools/ for more details. In such cases you will need a license.
I only want the ntopng server to process ZMQ feeds from nProbe so I will systemctl disable nprobe.
I have InfluxDB 1.7.6 running as the timeseries database and that seems to be working fine. With nProbe disabled on the ntopng server, I am not seeing any flow exporters or any SNMP chart views.
Without nProbe is normal that "Flow exporters" is not shown as you are now only capturing from the network interfaces. For the chart you need to way at least 2 SNMP refresh times, so about 10 minutes after ntopng start and then they should appear.
I am not seeing the chart icon in the SNMP Device view after an hour.
Do you have any alerts below the InfluxDB "System" menu entry? Do you have errors in the ntopng log?
No alerts below the InfluxDB "System" menu entry. A few errors in ntopng.log
20/Jun/2019 12:19:46 [LuaEngine.cpp:9040] WARNING: Script failure [/usr/share/ntopng/scripts/callbacks/system/5min.lua][.../share/ntopng/scripts/call backs/system/5min/influxdb.lua:120: attempt to call a nil value (method 'getInfluxdbVersion')] 20/Jun/2019 12:24:23 [LuaEngine.cpp:9040] WARNING: Script failure [/usr/share/ntopng/scripts/callbacks/system/5min.lua][.../share/ntopng/scripts/call backs/system/5min/influxdb.lua:120: attempt to call a nil value (method 'getInfluxdbVersion')]
I am seeing the chart icons slowly populate, there are maybe 25 out of the 76 SNMP devices showing up. Perhaps if I wait a few more hours they will all get there.
I suspect the problem is related to SNMP walks taking too much time. In https://github.com/ntop/ntopng/commit/8a239f97f443a442f595a4c2bd0e9ead37657e60 I've added a trace and alerts to monitor such duration. Please wait one hour and install the new package. Then check out the system alerts after 30 minutes and see if there are "Slow Periodic Activity" alerts:
I am seeing a couple of "Slow Periodic Activity" alerts after the upgrade. I have also reduced my SNMP devices from 76 to 17 to see if that helps.
I am getting a few of these alerts per hour since I installed the patch yesterday.
We need to fix this
Let me know what I can do to help.
@kenatgcc we pushed a fix for this to control the maximum duration of the SNMP walks, please update and let us know. Thank you.
The alerts are still occurring after applying v3.9.190629 but the time to complete is reduced.
The alerts are still occurring after applying v3.9.190629 but the time to complete is reduced.
@kenatgcc please update again to today's build and let us see the time to complete. Thank you.
Some initial slow alerts, I will check again in about an hour and post results.
Alert for timeseries and discover.lua where incorrect, now fixed in d8d2638d50d48f4592191b2c4f0484e00970a594 . The second script however is taking very long, are you still using InfluxDB for the export as shown in image below?
Yes, InfluxDB is still my database. The latest system alerts are below
@kenatgcc please update again (sorry for all the iterations, we are trying to identify what is causing the delay in your installation). Thank you.
@cardigliano I may have mentioned before that this is a proof of concept machine, basically just a desktop pc. Below is a snapshot from Webmin, does this application need more a powerful machine to run properly? Load average rarely goes above 1.00.
Here's some good news, there have been no "Slow Periodic Activity" alerts since the update and restart @ 08:10 EDT.
@kenatgcc the machine you are currently using seems to be powerful enough for what you are doing now. As of alerts, we are currently interrupting snmp polling to honour the 5 minutes polling slot, however we should notify what activity has been interrupted now. Thank you for your feedback.
@kenatgcc I pushed more improvements to detect "unresponsive" devices when snmp polling is interrupted: you should see a triangle (warning) on the corresponding device in the list. This will be available with the next build.
@kenatgcc please confirm that you see the warning (triangle) for "unresponsive" devices, however it seems the peridoic activity honours the deadline now
@cardigliano I see the triangles indicating "unresponsive" SNMP devices but, the devices respond and complete "snmpwalk" command from the server within a few seconds... What do you suggest to fix the "unresponsive" devices? Delete and re-add perhaps??
@kenatgcc this seems to be due to a few factors: the number of walks ntopng does, the time each walk takes, the number of devices and ports you have. We need to rework the way ntopng polls the devices to optimize all of this and make sure it completes all operations within the time slot even with the number of devices you have.
@cardigliano should I just delete the unresponsive SNMP devices?
@kenatgcc as I said we need to improve snmp polling to try to speed up it and make it work for you. In the meanwhile, it's up to you if you want to keep those devices there, however please note that polling is getting interrupted for those devices, thus data is missing for them.
I get "No results found" when trying to view charts for SNMP devices or flow exporters. I have SNMP timeseries enabled. My system info is :
Debian GNU/Linux 9.1 (stretch) Debian 9.1 [x86_64][Debian GNU/Linux 9.1 (stretch)] - 64 bit ntopng --interface "tcp://10.4.56.32:5556" --interface "enp0s25" --local-networks "10.3.0.0/16" 2.9.0-1555-41b7548 7.52.1 3.x 4.x 1.6.0 1.0 5m 3.2.6 3.7 Lua 5.3.5 4.2.1 1.2.0 This product includes GeoLite data created by MaxMind. 2.9.1 / 3.0