pavel-odintsov / fastnetmon

FastNetMon - very fast DDoS sensor with sFlow/Netflow/IPFIX/SPAN support
https://fastnetmon.com
GNU General Public License v2.0
3.4k stars 561 forks source link

Mikrotik RouterOS v6.49.6 encodes sampling rate using wrong byte order (endian-less) in Netflow v9 #985

Open pavel-odintsov opened 1 year ago

pavel-odintsov commented 1 year ago

This issue started as part of investigation of reasons for 100% CPU usage on Mikrotik boxes during DDoS. Particular deployment had no firewall rules and expected to handle all the traffic using Fast Path. During DDoS attack customer noticed CPU usage spike to 100% which clearly led to network performance degradation.

After checking for lack of firewall rules victim role switched to non sampled by default Netflow which may have caused significant CPU usage as flow tracking is extremely challenging task for hardware especially in mostly software based routers like Mikrotik.

We obtained pcap dump from one of the customers which highlights issue in Mikrotik's Netflow v9 implementation when sampling is enabled.

Setup for this case was following:

set active-flow-timeout=30s cache-entries=16M enabled=yes inactive-flow-timeout=30s packet-sampling=yes sampling-interval=2222 sampling-space=1111

The very first issue we noticed was fact that Mikrotik uses data templates to deliver sampling rate instead of using template datagrams:

Screenshot from 2023-06-20 15-47-09

As an engineer I partially agree with their decision as option encoding is incredibly complicated but majority of Netflow collectors with struggle with such approach.

They use field 34 called samplingInterval and on IPFIX RFC it explained following way: "When using sampled NetFlow, the rate at which packets are sampled -- e.g., a value of 100 indicates that one of every 100 packets is sampled."

So it's basically sampling rate as is.

Let's check what we see in Wireshark:

image

It's clearly has nothing to do with 1111 or 2222 or 2 (sampling rate) configured in router's configuration.

What does 16777216 actually mean?

When I see such large numbers I immediately blame endianless-ness and it's for sure was the case in this particular scenario.

With help of small app we can decode it:

./a.out 
Data as is in big endian: 16777216
Data in host byte order: 1

test.cpp code:

#include <iostream>
#include <cstdint>

#include <arpa/inet.h>

int main() {
    uint32_t sampling_value = 0x01000000;

    std::cout << "Data as is in big endian: " << sampling_value << std::endl;

    std::cout << "Data in host byte order: " << ntohl(sampling_value) << std::endl;

    return 0;
}

What does it mean?

All the fields in Netflow have to be in network byte order which is also known as big endian.

Instead of encoding it this way Mikrotik stored it in little endian which is completely wrong and that's a reason why instead of 1 we see 16777216 in both FastNetMon and Wireshark.

Finally, why on Earth we see 1? We may have 1111, 2222 or 2 (actual sampling rate in this setup) but non of them is not 1.

You may guess that 1 means "sampling not enabled" and it will be wrong assumption as in such cases we see value 0: image

Our current conclusion that we cannot even try to add support of such peculiar encoding due to so many issues.

We will be very grateful if active Mikrotik customers report this issue to support@mikrotik.com

We have no data for RouterOS 7 and if you can share it with it will be very helpful.

Thank you!

pavel-odintsov commented 1 year ago

Linking with similar issues: https://github.com/netsampler/goflow2/issues/113 and https://github.com/akvorado/akvorado/issues/417

pavel-odintsov commented 1 year ago

As a good news with great assistance from Community we got pcap with ROS7.10 and it works just fine:

2023-06-20 17:17:28,862 [INFO] Got sampling date from data packet: 1001
2023-06-20 17:17:28,862 [INFO] Got sampling date from data packet: 1001
2023-06-20 17:17:28,862 [INFO] Got sampling date from data packet: 1001

Router's configuration:

/ip/traffic-flow/set packet-sampling=yes sampling-interval=1 sampling-space=1000
pavel-odintsov commented 1 year ago

Even better news that we added support for RouterS v7 encoding format in FastNetMon Advanced and it will be part of next release. It will require setting flag netflow_v9_read_sampling_rate_in_data_section