librenms / librenms

Community-based GPL-licensed network monitoring system
https://www.librenms.org
Other
3.63k stars 2.23k forks source link

Default rrdtool create & script "tune_port.php" should be changed. #14582

Open NightowlKr opened 1 year ago

NightowlKr commented 1 year ago

The point of this issue is that the graph data is strange.

1

Because it's the sum graph for the two ports below.

2 3

Moreover, the total graph was also strange.

4

I checked that the collection was working fine.

5

So, i found the point of failure.

6

Eventually, i fount no data in the point at rrd file.

7

When i looked at the structure of the rrd file, it seemed strange. Because according to the document below, it had to be supported up to 100Gbps. https://docs.librenms.org/Extensions/RRDTune/

8

It was the same even if the corresponding php was executed manually.

10

11

Finally, to resolve this symptom, I changed it manually by referring to the URL below. https://oss.oetiker.ch/rrdtool/doc/rrdtune.en.html#:~:text=disable%20this%20limit.-,%2D%2Dmaximum%7C%2Da%C2%A0ds%2Dname%3Amax,-alter%20the%20maximum

12

Output of ./validate.php

===========================================
Component | Version
--------- | -------
LibreNMS  | 22.10.0-104-g89698ed59 (2022-11-05T00:03:46+09:00)
DB Schema | 2022_08_15_084507_add_rrd_type_to_wireless_sensors_table (248)
PHP       | 8.1.12
Python    | 3.8.10
Database  | MariaDB 10.6.10-MariaDB-1:10.6.10+maria~ubu2004
RRDTool   | 1.7.2
SNMP      | 5.8
===========================================

[OK]    Composer Version: 2.4.4
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database Schema is current
[OK]    SQL Server meets minimum requirements
[OK]    lower_case_table_names is enabled
[OK]    MySQL engine is optimal
[OK]
[OK]    Database schema correct
[OK]    MySQl and PHP time match
[OK]    Active pollers found
[OK]    Dispatcher Service not detected
[OK]    Locks are functional
[OK]    Python poller wrapper is polling
[OK]    Redis is unavailable
[OK]    rrd_dir is writable
[OK]    rrdtool version ok
[FAIL]  We have found some files that are owned by a different user than 'librenms', this will stop you updating automatically and / or rrd files being updated causing graphs to fail.
        [FIX]:
        sudo chown -R librenms:librenms /opt/librenms
        sudo setfacl -d -m g::rwx /opt/librenms/rrd /opt/librenms/logs /opt/librenms/bootstrap/cache/ /opt/librenms/storage/
        sudo chmod -R ug=rwX /opt/librenms/rrd /opt/librenms/logs /op
        Files:
         /opt/librenms/rrd/smokeping/Ungrouped/~~~~~~~~~.rrd
         /opt/librenms/rrd/smokeping/network/~~~~~~~~.rrd
         /opt/librenms/rrd/smokeping/__sortercache/data.lnmsFPing-1.storable
         /opt/librenms/rrd/smokeping/__sortercache/data.lnmsFPing-0.storable

What was the last working version of LibreNMS?

22.10.0

Anything in the logs that might be useful for us?

No response

murrant commented 1 year ago

You hit the max default value for port bits. The reason it has a max is because bad values either from the device (or network interruptions) can cause spikes.

@librenms/reviewers Thoughts on the max value? 10G is very common these days and 40G and 100G+ are becoming more common.

  1. Remove the max, let the spikes be very large.
  2. Increase the max, we still care about spikes, but want to minimize user adjustments.
NightowlKr commented 1 year ago

@murrant According to your opinion, I set it to 800G (800 1000 1000 * 1000 / 8 = 100000000000). Because recently standards is QSFP-DD800. http://www.qsfp-dd.com/specification/

murrant commented 1 year ago

If we go with option 2, that would be fine for most people for awhile.

ottorei commented 1 year ago

If we go with option 2, that would be fine for most people for awhile.

Agree, the value should probably be at least a few hundred gigs since 100G ports are becoming quite popular on newer hardware.

murrant commented 1 year ago

Sure, send a change to update the value to 100000000000, remember that it will only affect new RRDs.

paulgear commented 1 year ago

I guess the issue is which is more common: buggy SNMP implementations causing spikes, or hitting the default max? Going forward it will probably be more and more likely to hit the max, so bumping it seems like a good choice. However, a 5 Gbps spike on my VDSL uplink at home is still going to show up as a huge anomaly, so people will likely have to manually set the value lower on some ports anyway. I'd probably vote for bumping it to 100 Gbps by default so that people don't end up with bad data in their RRDs.

murrant commented 1 year ago

@paulgear I think LibreNMS could do a better job of detecting snmp queries interrupted mid-query and prevent writing 0s to the rrd. That would help to significantly reduce the chance of spikes.

Unfortunately, that involves refactoring the ports module which is a huge monster containing all kinds of black magic.