phaag / nfdump

Netflow processing tools
Other
760 stars 198 forks source link

Please explain the output of nfdump -E #335

Closed lisaens closed 2 years ago

lisaens commented 2 years ago

Could someone explain the output of "nfdump -E"? I'll include a couple examples to be concrete. I have good guesses for some fields, but would like to confirm and also understand what the other fields are. Why are there multiple lines (sysIDs) and what does the number of SysIDs indicate? What does it mean if there are/aren't packets and flows listed? Are Sequence Failures important? Apparently the Interval is the sampling rate. I've seen "interval: 9999" for sflow; is that always the case? There are other cases where there is no Sampler line under the SysID line at all, and thus no Interval is given, with or without a number of flows listed. Why?
(In these examples, there are 2 IPs listed, meaning 2 exporters sending to the same port, thus mixed together in the nfcapd file? Will that cause problems with nfdump -a commands? Thank you!

$ nfdump -E nfcapd.202203170910 SysID: 1, IP: 192.43.217.141, version: 10, ID: 524289, Sequence failures: 22, packets: 15292, flows: 52096
Sampler for Exporter SysID: 1, Generic Sampler: mode: 0, interval: 100

SysID: 2, IP: 192.43.217.141, version: 10, ID: 525057, Sequence failures: 30, packets: 181255, flows: 724450 Sampler for Exporter SysID: 2, Generic Sampler: mode: 0, interval: 100

SysID: 3, IP: 192.43.217.141, version: 10, ID: 590593, Sequence failures: 0, packets: 1111, flows: 1135 Sampler for Exporter SysID: 3, Generic Sampler: mode: 0, interval: 100

SysID: 4, IP: 192.43.217.141, version: 10, ID: 524291, Sequence failures: 20, packets: 27696, flows: 109098 Sampler for Exporter SysID: 4, Generic Sampler: mode: 0, interval: 100

SysID: 5, IP: 192.43.217.140, version: 10, ID: 524544, Sequence failures: 10, packets: 2396, flows: 9029
Sampler for Exporter SysID: 5, Generic Sampler: mode: 0, interval: 100 ... SysID: 37, IP: 192.43.217.141, version: 10, ID: 589826 Sampler for Exporter SysID: 37, Generic Sampler: mode: 0, interval: 4096 ... SysID: 54, IP: 192.43.217.140, version: 10, ID: 589827 Sampler for Exporter SysID: 54, Generic Sampler: mode: 0, interval: 100

$ nfdump -E nfcapd.202203150110 Exporters:

SysID: 1, IP: 192.168.144.1, version: 10, ID: 721168, Sequence failures: 0, packets: 10137, flows: 15155

SysID: 2, IP: 192.168.144.1, version: 10, ID: 721152, Sequence failures: 0, packets: 10012, flows: 14874

SysID: 3, IP: 192.168.144.1, version: 10, ID: 524561, Sequence failures: 0, packets: 1095, flows: 1133 . . . SysID: 15, IP: 192.168.144.1, version: 10, ID: 524304, Sequence failures: 0, packets: 3, flows: 3

SysID: 16, IP: 164.58.2.1, version: 10, ID: 524305 Sampler for Exporter SysID: 16, Generic Sampler: mode: 0, interval: 1000

SysID: 17, IP: 164.58.2.1, version: 10, ID: 721168 Sampler for Exporter SysID: 17, Generic Sampler: mode: 0, interval: 1000

SysID: 18, IP: 164.58.2.1, version: 10, ID: 720896 Sampler for Exporter SysID: 18, Generic Sampler: mode: 0, interval: 1000 . . .

lisaens commented 2 years ago

Also, what would you expect to see if the router was not sending the sampling rate and what would change if you added "-s 100" to the nfcapd command to do the sampling rate correction manually?

phaag commented 2 years ago

Things are not that complicated, as they may look like :)

For exporter version 10 - IPFIX: An exporter is uniquely identified by the sending IP address - the netflow version in the header '10' and the observation domain. So an export may have several observation domains, which may have different metrics applied. This is something which is configured in the exporting device internally. Therefore, each exporter can many many export streams mit different metrics. Each running nfcapd collector assigns a collector internal SysID (continuously increasing) to distinguish between them. This is the SysID you see in -E. If the exporter also announces a sampling rate, this creates a sampling record along the exporter in question. What you see in the output is the announced sampling interval. If you overwrite sampling (-s), then this fact in not reflected in -E.

For SFLOW, the "netflow" version is set to 9999, as SFLOW has no assigned number comparable to netflow. An SFLOW exporter is uniquely identified by the sending IP address, the agentSubId and the SFLOW version number (2, 4 etc) Likewise each SFLOW exporter gets assigned a nfcapd internal SysID to uniquely identify identify this stream. Agreed, the output of -E could potentially be a bit more readable, specifically for SFLOW

As of the sequence failures: exported netflow packets have a sequence attached, which gets increased with each packet. This is done by the exporting device. By monitoring this sequence the collector can identify missing packets, if the received sequence does not match the expected sequence. Often it is not easy to find the point where packets get lost. Of course, 0 is expected anything above should be compared relative to the number of exported flows. If you have millions of flows an a few loses, it's different than a lower number of flows and high loses. A sequence los may happen in the exporting device, or in any of the network devices in between - in the kernel of the collector's box, or the collecting process. If you suspect the process try to increase the network buffer -B of nfcapd. However, this buffer only helps to break packet peaks and not a sustained flow of records.

Hope, this helps

lisaens commented 2 years ago

Yes, that helps (after looking up what an observation domain is)!

Let me know if any of this is incorrect: If there are different IPs, there may be different routers, for example, sending to the same collector/same nfcapd process/same port. If there are no packets or flows listed, that export stream has not exported any flow data; no flows have gone through the associated interface for whatever reason. The numbers refer to the time covered by the nfcapd file, I assume. If there is no Sampler line, it means no sampling rate has been announced for that stream and we would need to manually make corrections. If there is a Sampler line, nfdump will use the rate on that line (interval) to do the corrections automatically. I'm glad to be able to use this command to view the sampling rates!

Thanks for the info about sequence failures.

phaag commented 2 years ago

Your assumption is correct. If you see 0 flow and packets, it means within the time window represented in the current file, there were no packets from this exporter referenced by this SysID. However, at some time in the past (since lifetime of the collector process) this exporter has sent packets, otherwise it wouldn't be known. Btw. the ID you see in the -E output is identical the the observation domain ID of the exporter. As of sampling - yes your assumption is correct as well.

lisaens commented 2 years ago

Super! Thanks.

One last (?) question... Do you know anything about adaptive sampling (Juniper QFXs)? We have a couple of these exporting data. "nfdump -E" says there is just one exporter with a certain sampling rate. "show sflow interface" on the device lists each interface, the sampling rate that was set, and the adapted/current sampling rate. The latter varies among the interfaces.
Might an Observation Domain include interfaces with different sampling rates? Would 'nfdump -E' then list only one of the rates? Would 'nfdump -a ...' still work correctly?

phaag commented 2 years ago

Unfortunately I am not familiar with Juniper's term adaptive sampling. However, sampling is announced in option templates, which are processed by nfcapd. Sampling my be applied globally on a device to all flow records of an exporter likewise. That is the older sampling model and referred to as "Generic Sampler" in the output of -E. An exporter may also have multiple samplers configured and each flow record references the ID of a sampler, which applied internally by the router. Nfcapd applies this sampling to that flow. In the -E output you may see multiple sampling records for that exporter without the string "Generic".

Sampling records are sent periodically in option templates along the flow records. It is important that this interval is not too long, as nfcapd treats flows as unsampled, if it has not seen a matching sampler in an option template before. If a flow is sampled or not, can be checked in the -o raw output format. At the top in the flags section, you would see "sampled". Then the sampling rate got applied. You may also check the daemon syslog file. Events of first time seen a new exporters or samplers are logged.

What an Observation Domains includes may be vendor dependant. Therefore it is a bit tricky to translate vendor interface commands into proper ipfix records.

All that said, aggregation -a is an upper level evaluation in nfdump, which simply aggregates flow record data. If for some reason nfcapd did not or could not apply sampling correctly, then it is too late for -a.

If you think that data is not correct, I would need a pcap data stream to the collector to debug.

lisaens commented 2 years ago

I'm also struggling to understand 'nfdump -E' when run on nfcapd files collected by a docker container (https://github.com/netsage-project/docker-nfdump-collector).

phaag commented 2 years ago

What do you mean by that? -E does not make a difference if in docker or not.

lisaens commented 2 years ago

What I've observed is that sometimes the IP in the 'nfdump -E' output is that of the router, sometimes it's the IP of docker's network gateway. I'm not sure whether it's random or depends on how docker is set up, but as far as I can tell, it doesn't matter to nfcapd.
(We have data coming in from several docker installations set up by customers. I don't know what their docker setups are, but I asked some of them to run 'nfdump -E' on a random nfcapd file.

For example, here 172.23.0.1 is a docker IP and 155.232.243.1 is the router IP:

Exporters: SysID: 1, IP:       172.23.0.1, version: 10, ID: 524289, Sequence failures: 42, packets: 47117, flows: 187995
SysID: 2, IP:       172.23.0.1, version: 10, ID: 524545, Sequence failures: 19, packets: 13682, flows: 54222
SysID: 3, IP:       172.23.0.1, version: 10, ID: 589825, Sequence failures: 0, packets: 930, flows: 1074
SysID: 4, IP:       172.23.0.1, version: 10, ID: 590081, Sequence failures: 0, packets: 157, flows: 164
SysID: 5, IP:    155.232.243.1, version: 10, ID: 524289         Sampler for Exporter SysID: 5,  Generic Sampler: mode: 0, interval: 100 SysID: 6, IP:    155.232.243.1, version: 10, ID: 524288         Sampler for Exporter SysID: 6,  Generic Sampler: mode: 0, interval: 100 SysID: 7, IP:    155.232.243.1, version: 10, ID: 589824         Sampler for Exporter SysID: 7,  Generic Sampler: mode: 0, interval: 100 SysID: 8, IP:    155.232.243.1, version: 10, ID: 589825
        Sampler for Exporter SysID: 8,  Generic Sampler: mode: 0, interval: 100

Sometimes the IP on the lines with flows is one thing, sometimes it's the other. Sometimes the lines with flows will also have Samplers listed. When they don't, looks like you need to do manual sampling corrections.

lisaens commented 2 years ago

Furthermore, whether the docker gateway IP or the router IP is listed for the lines with flows seems to change, perhaps when docker and/or the containers are restarted. Sometimes only one or the other is there on lines with flows, sometimes both. Anyone with more experience with Docker know what's going on?

phaag commented 2 years ago

This all seems to be related to the way docker is configured. The IP address you see in -E is actually the one nfcapd sees as peer IP address, sending the flow record. In a docker environment, it depends on how networks are setup. I am not a docker expert, but docker allows many options to configure networks: https://docs.docker.com/network/

lisaens commented 2 years ago

Yes, I think we need to learn more about how Docker works. But thanks very much for all your answers! We appreciate the help.

phaag commented 2 years ago

Welcome!