ntop / ntopng

Web-based Traffic and Security Network Traffic Monitoring
http://www.ntop.org
GNU General Public License v3.0
6.21k stars 651 forks source link

Graphs from cento created flows are incorrect #8041

Open lmartin-spsd opened 10 months ago

lmartin-spsd commented 10 months ago

Environment:

What happened: Graphs are incorrect when collecting flows from cento.

On a network that was at the time extremely low traffic such that one could basically assume that the tests they were doing generated ~90% of traffic, two tests were conducted.

  1. Speedtest App on a Windows Machine
  2. Upload/Download from Google Drive

The resultant traffic graphs of the above tests were monitored via ntopng. The tests were conducted twice, once with cento exporting flows into ntopng, and the second with ntopng directly monitoring the interface.

To more easily explain the issue, we're going to pretend a flow is a tuple of {flow-begin, flow-end, bytes-sent, bytes-received}. As will be shown in the graphs below, the flows collected reflect reality, be it from cento or direct monitoring, for at the least test <2>. Test <1> will not be focused upon because it is a collection of flows, and not a single flow, and it is more complex to comprehend as to what's going on at least in the cento flow generation case (I still cannot comprehend it). The graphs of the resultant flows are incorrect when cento is the flow producer. When the traffic is analyzed and a flow is generated one would see a flow looking like {$time, $time+(1 minute), 1.5GB, 0GB} with the bytes-sent and bytes-received flopped depending on if it was the download or upload part of the test. The graph, when ntopng directly monitors the mirrored interface, shows this correctly. When the flow is collected from cento the graph is shown as if the flow that gets pushed into the time series is something like {$time, $time+(1 second), 1.5GB, 0GB}. In ntopng, when monitoring directly the mirror interface, it is set as such in the interface configuration and the egress MAC is assigned. In cento --if-networks and --iface-id is set similarly. In each case, it is a 10GB NIC with 2xRSS Queues and cpu affinity assigned such that monitoring is on the same NUMA node. It should also be noted that the link, while being a 10gbps link, is limited by the ISP to 3gbps, so some graphing from flows produced by cento is "impossible" in that the max throughput without considering direction of the traffic is 6gbps and in the <1> test case it wasn't uncommon to see the total throughput for the time graphed to exceed that. With the cento produced flows, if one zooms out to 2 hours, they start to begin reflecting reality.

How did you reproduce it?

I just reconfigure ntopng to use cento, or direct interfaces. Cento seems to be operating correctly and sending correct flows, it's just ntopng doesn't graph them correctly. Ntopng is configured with InfluxDB as the time series driver with 1 minute resolution and Clickhouse is configured to store historical flows (so I can still look back at these, even though I can regenerate the issue as well).

Debug Information:

cento produced flow for test <2>: image

graph of cento produced flow for test <2>: image

graph zoomed out to 2 hours of cento produced flow for test <2>: image

ntopng direct interface monitoring graph for test <2>: image

ntopng direct interface monitoring flow produced for test <2>: image

lucaderi commented 10 months ago

Using packet capture (i.e. ntopng direct interface capture) you have per-second throughout statistics, whereas with cento you use average throughout statistics. This is because flows are emitted periodically according to timeouts that by default they are set to

[--lifetime-timeout|-t] <sec>           | Set maximum flow duration (sec). Default: 120
[--idle-timeout|-d] <sec>               | Set maximum flow idleness (sec). Default: 60

So in essence as both flows (with/without cento) last ~1 minute and have a similar amount of traffic, I believe that the discrepancies in throughput are due to the use of flows. You can mitigate the reducing -t but the idea of flows is to use them to measure traffic "compressing" packets into a single flow measurement, thus what you observe I believe it's correct.

lmartin-spsd commented 10 months ago

So a 3.2gbps spike that is less than a minute long (actually single point) would be representative of a flow that says it had a throughput of more than 150mbps and less than 300mbps with a begin/end time spanning about a minute?

I realize my graph from the cento flows was with a resolution of 10 minutes but it shows the flow as almost happening in a single point of time, if I go back and do the graph with a 10 minute resolution from the same time points as the graph I did with packet capture it looks like the following: image

That puts both the cento graph and the packet capture graph in the same perspective.

The cento one I would have expected something like the packet capture graph, but with a point at the beginning and the end of the flow, and a flat line between them, at the average throughput. ~3gbps on a single point is closer to "all data was sent instantaneously" as if the flow begin/end markers in the flow were ignored.

lmartin-spsd commented 10 months ago
[--lifetime-timeout|-t] <sec>           | Set maximum flow duration (sec). Default: 120
[--idle-timeout|-d] <sec>               | Set maximum flow idleness (sec). Default: 60

So in essence as both flows (with/without cento) last ~1 minute and have a similar amount of traffic, I believe that the discrepancies in throughput are due to the use of flows. You can mitigate the reducing -t but the idea of flows is to use them to measure traffic "compressing" packets into a single flow measurement, thus what you observe I believe it's correct.

I agree the idea of flows is to "compress" in essence.

Due to the suggestion regarding -t and -d I did try lowering them from the defaults. All previous graphs and flows were generated under default settings. I performed test <2> with -t 60 -d 30 () and -t 30 -d 15 (). The lowest the graph ever got was 1.12Gbps on a flow that is measuring it's throughput as less than 200Mbps with . Slight improvement from normal settings but ultimately still a drastic spike over a single point. I've since reset those options to the defaults.

If indeed a single flow is mapped to a single point on the graph, then this all makes sense (somewhat, without doing any math), though the question might become why is a flow represented by a single point on the graph when it contains a begin/end marker within it.