Incorrect data in network protocol address lists.

vel21ripn commented 10 months ago

Describe the bug

Some networks are described in more than one protocol.

Another problem is the lack of subnet aggregation.

To solve these problems, we need to abandon "include inc_generation/*.c.inc" and switch to automated construction of subnet lists. It also makes sense to abandon separate loading of address lists. We need to make a mask of loaded lists and one list of addresses. If I’m not mistaken, I proposed storing lists of addresses in .yaml files and collecting an optimized list of addresses from them, but for some reason the implementation was not included in nDPI. The format of the file with the list of addresses is not significant. Using '.c' files to store lists of addresses is also a good option.

The only difficulty in solving these problems is the lack of protocol names. I use a non-cross-platform solution in the form of a perl script that generates the necessary data from the ndpi_protocol_ids.h file. I don't know how much this is acceptable in an nDPI project.

I can offer my PR

Ambiguous address:

IPv4

23.239.227.0/24    GOTO                  != 23.239.227.0/24    CITRIX
67.217.68.0/24     GOTO                  != 67.217.68.0/24     CITRIX
67.217.70.0/23     GOTO                  != 67.217.70.0/23     CITRIX
67.217.72.0/24     GOTO                  != 67.217.72.0/24     CITRIX
67.217.75.0/24     GOTO                  != 67.217.75.0/24     CITRIX
67.217.76.0/23     GOTO                  != 67.217.76.0/23     CITRIX
67.217.78.0/24     GOTO                  != 67.217.78.0/24     CITRIX
67.217.80.0/23     GOTO                  != 67.217.80.0/23     CITRIX
67.217.82.0/24     GOTO                  != 67.217.82.0/24     CITRIX
67.217.84.0/24     GOTO                  != 67.217.84.0/24     CITRIX
67.217.86.0/24     GOTO                  != 67.217.86.0/24     CITRIX
67.217.88.0/24     GOTO                  != 67.217.88.0/24     CITRIX
67.217.90.0/23     GOTO                  != 67.217.90.0/23     CITRIX
67.217.92.0/24     GOTO                  != 67.217.92.0/24     CITRIX
67.217.94.0/23     GOTO                  != 67.217.94.0/23     CITRIX
68.64.8.0/23       GOTO                  != 68.64.8.0/23       CITRIX
68.64.10.0/24      GOTO                  != 68.64.10.0/24      CITRIX
68.64.12.0/24      GOTO                  != 68.64.12.0/24      CITRIX
68.64.14.0/24      GOTO                  != 68.64.14.0/24      CITRIX
68.64.17.0/24      GOTO                  != 68.64.17.0/24      CITRIX
68.64.18.0/23      GOTO                  != 68.64.18.0/23      CITRIX
68.64.20.0/24      GOTO                  != 68.64.20.0/24      CITRIX
68.64.22.0/23      GOTO                  != 68.64.22.0/23      CITRIX
68.64.24.0/23      GOTO                  != 68.64.24.0/23      CITRIX
68.64.27.0/24      GOTO                  != 68.64.27.0/24      CITRIX
68.64.28.0/23      GOTO                  != 68.64.28.0/23      CITRIX
68.64.30.0/24      GOTO                  != 68.64.30.0/24      CITRIX
78.108.116.0/22    GOTO                  != 78.108.116.0/22    CITRIX
78.108.120.0/23    GOTO                  != 78.108.120.0/23    CITRIX
78.108.126.0/23    GOTO                  != 78.108.126.0/23    CITRIX
173.199.0.0/22     GOTO                  != 173.199.0.0/22     CITRIX
173.199.12.0/23    GOTO                  != 173.199.12.0/23    CITRIX
173.199.15.0/24    GOTO                  != 173.199.15.0/24    CITRIX
173.199.17.0/24    GOTO                  != 173.199.17.0/24    CITRIX
173.199.18.0/23    GOTO                  != 173.199.18.0/23    CITRIX
173.199.20.0/24    GOTO                  != 173.199.20.0/24    CITRIX
173.199.23.0/24    GOTO                  != 173.199.23.0/24    CITRIX
173.199.26.0/23    GOTO                  != 173.199.26.0/23    CITRIX
173.199.30.0/23    GOTO                  != 173.199.30.0/23    CITRIX
173.199.43.0/24    GOTO                  != 173.199.43.0/24    CITRIX
173.199.44.0/22    GOTO                  != 173.199.44.0/22    CITRIX
173.199.50.0/23    GOTO                  != 173.199.50.0/23    CITRIX
173.199.52.0/22    GOTO                  != 173.199.52.0/22    CITRIX
173.199.60.0/22    GOTO                  != 173.199.60.0/22    CITRIX
188.66.43.0/24     GOTO                  != 188.66.43.0/24     CITRIX
202.173.25.0/24    GOTO                  != 202.173.25.0/24    CITRIX
216.115.208.0/24   GOTO                  != 216.115.208.0/24   CITRIX
216.115.210.0/23   GOTO                  != 216.115.210.0/23   CITRIX
216.115.213.0/24   GOTO                  != 216.115.213.0/24   CITRIX
216.115.214.0/23   GOTO                  != 216.115.214.0/23   CITRIX
216.115.217.0/24   GOTO                  != 216.115.217.0/24   CITRIX
216.115.218.0/24   GOTO                  != 216.115.218.0/24   CITRIX
216.115.221.0/24   GOTO                  != 216.115.221.0/24   CITRIX
216.115.222.0/23   GOTO                  != 216.115.222.0/23   CITRIX
216.219.114.0/23   GOTO                  != 216.219.114.0/23   CITRIX
216.219.116.0/24   GOTO                  != 216.219.116.0/24   CITRIX
216.219.119.0/24   GOTO                  != 216.219.119.0/24   CITRIX
216.219.120.0/22   GOTO                  != 216.219.120.0/22   CITRIX
157.55.39.0/24     MODBUS                != 157.55.39.0/24     MICROSOFT_AZURE
207.46.13.0/24     MODBUS                != 207.46.13.0/24     MICROSOFT_AZURE
40.77.167.0/24     MODBUS                != 40.77.167.0/24     MICROSOFT_AZURE
40.77.188.0/22     MODBUS                != 40.77.188.0/22     MICROSOFT_AZURE
65.55.210.0/24     MODBUS                != 65.55.210.0/24     MICROSOFT_AZURE
199.30.24.0/23     MODBUS                != 199.30.24.0/23     MICROSOFT_AZURE
40.77.202.0/24     MODBUS                != 40.77.202.0/24     MICROSOFT_AZURE
40.77.139.0/25     MODBUS                != 40.77.139.0/25     MICROSOFT_AZURE
69.63.176.0/20     MODBUS                != 69.63.176.0/20     FACEBOOK
66.220.144.0/20    MODBUS                != 66.220.144.0/20    FACEBOOK
74.119.76.0/22     MODBUS                != 74.119.76.0/22     FACEBOOK
173.252.64.0/18    MODBUS                != 173.252.64.0/18    FACEBOOK
69.171.224.0/19    MODBUS                != 69.171.224.0/19    FACEBOOK
103.4.96.0/22      MODBUS                != 103.4.96.0/22      FACEBOOK
31.13.64.0/18      MODBUS                != 31.13.64.0/18      FACEBOOK
31.13.24.0/21      MODBUS                != 31.13.24.0/21      FACEBOOK
179.60.192.0/22    MODBUS                != 179.60.192.0/22    FACEBOOK
185.60.216.0/22    MODBUS                != 185.60.216.0/22    FACEBOOK
45.64.40.0/22      MODBUS                != 45.64.40.0/22      FACEBOOK
157.240.0.0/17     MODBUS                != 157.240.0.0/17     FACEBOOK
204.15.20.0/22     MODBUS                != 204.15.20.0/22     FACEBOOK
102.132.96.0/20    MODBUS                != 102.132.96.0/20    FACEBOOK
157.240.192.0/18   MODBUS                != 157.240.192.0/18   FACEBOOK
129.134.0.0/17     MODBUS                != 129.134.0.0/17     FACEBOOK
163.70.128.0/17    MODBUS                != 163.70.128.0/17    FACEBOOK
185.89.216.0/22    MODBUS                != 185.89.216.0/22    FACEBOOK
20.190.128.0/18    MICROSOFT_365         != 20.190.128.0/18    MICROSOFT_AZURE
20.20.32.0/19      MICROSOFT_365         != 20.20.32.0/19      MICROSOFT_AZURE
20.231.128.0/19    MICROSOFT_365         != 20.231.128.0/19    MICROSOFT_AZURE
40.126.0.0/18      MICROSOFT_365         != 40.126.0.0/18      MICROSOFT_AZURE
104.47.0.0/17      MS_OUTLOOK            != 104.47.0.0/17      MICROSOFT_AZURE
13.107.64.0/18     SKYPE_TEAMS           != 13.107.64.0/18     MICROSOFT_AZURE
89.187.171.248     WHATSAPP_CALL         != 89.187.171.248     PROTONVPN
178.249.214.65     WHATSAPP_CALL         != 178.249.214.65     PROTONVPN

IPv6

2620:0:1c00::/40   MODBUS                != 2620:0:1c00::/40   FACEBOOK
2a03:2880::/32     MODBUS                != 2a03:2880::/32     FACEBOOK
2603:1006:2000::/48 MICROSOFT_365        != 2603:1006:2000::/48 MICROSOFT_AZURE
2603:1007:200::/48 MICROSOFT_365         != 2603:1007:200::/48 MICROSOFT_AZURE
2603:1016:1400::/48 MICROSOFT_365        != 2603:1016:1400::/48 MICROSOFT_AZURE
2603:1017::/48     MICROSOFT_365         != 2603:1017::/48     MICROSOFT_AZURE
2603:1026:3000::/48 MICROSOFT_365        != 2603:1026:3000::/48 MICROSOFT_AZURE
2603:1027:1::/48   MICROSOFT_365         != 2603:1027:1::/48   MICROSOFT_AZURE
2603:1036:3000::/48 MICROSOFT_365        != 2603:1036:3000::/48 MICROSOFT_AZURE
2603:1037:1::/48   MICROSOFT_365         != 2603:1037:1::/48   MICROSOFT_AZURE
2603:1046:2000::/48 MICROSOFT_365        != 2603:1046:2000::/48 MICROSOFT_AZURE
2603:1047:1::/48   MICROSOFT_365         != 2603:1047:1::/48   MICROSOFT_AZURE
2603:1056:2000::/48 MICROSOFT_365        != 2603:1056:2000::/48 MICROSOFT_AZURE
2603:1057:2::/48   MICROSOFT_365         != 2603:1057:2::/48   MICROSOFT_AZURE
2a01:111:f403::/48 MS_OUTLOOK            != 2a01:111:f403::/48 MICROSOFT_AZURE

IvanNardi commented 10 months ago

Hi @vel21ripn, thanks for these interesting inputs! We are having some internal discussions about how to improve these lists, and so any feedback is welcomed!

Let' start with the real bug: the overlapping addresses... There are a few different cases:

1) Goto/citrix: we are importing the same list twice! Nice catch. I am going to remove one of them

2) MICROSOFT_AZURE vs MICROSOFT_365: these addresses are present in both the original lists (azure and ms365) explicitly provided by Microsoft itself... not sure what we should do here....

3) We don't have a MODBUS list... There are two logical separated lists in inc/generation: one list with the addresses used for protocol classification (usually used to match server address; FB, Telegram, Whatsapp,...) and one list used for flow risk detection (used to match client address; iCloudPrivateRelay, ProtonVPN exit nodes and crawlers). It is definite possible to have some addresses in both logical list

IvanNardi commented 10 months ago

Another topic: aggregation. We already have a function (mergeipaddrlist.py) to aggregate addresses, but we don't use it everywhere. We should improve that...

vel21ripn commented 10 months ago

Thanks for the clarification regarding the MODBUS, iCloudPrivateRelay, ProtonVPN lists.

Information on the benefits of address aggregation. We have 40700 entries for ipv4 and 12397 entries for ipv6. After aggregation, we get 27811 records for ipv4 and 8216 records for ipv6. IMHO aggregation is useful.

IvanNardi commented 10 months ago

Thanks for the clarification regarding the MODBUS, iCloudPrivateRelay, ProtonVPN lists.

Information on the benefits of address aggregation. We have 40700 entries for ipv4 and 12397 entries for ipv6. After aggregation, we get 27811 records for ipv4 and 8216 records for ipv6. IMHO aggregation is useful.

@vel21ripn, could you check if bdb73db1a49d271bfb958eaabcce489013d84f3c fixes the aggregation issue, please?

vel21ripn commented 10 months ago

Very big difference in address lists between commit bdb73db1a49d271bfb958eaabcce489013d84f3c and 6c9571d9a92b8c71bd7b8a565f062a49bd7d4d49. Before this commit there were 40700 ipv4 addresses, but now there are 7679.

The TOR and MULLVAD address list is not aggregated. TOR 1327 -> 896 MULLVAD 643 -> 537

vel21ripn commented 10 months ago

Thank you. Reducing the number of networks by more than 4 times is very good result.

There is one more question: if the lists are generated by a script, then what is the point of storing ipv6 addresses as a string? The sum of the lengths of all 2980 lines with ipv6 addresses is equal to 50386 bytes, and 2980*16 is equal to 47888. So, if we use a binary representation for storage, this will also reduce the required amount of memory and reduce the cost of initializing address lists.

IvanNardi commented 10 months ago

The TOR and MULLVAD address list is not aggregated.

Done in 55664392a9661a3061bc0e1325e354863946814d

IvanNardi commented 10 months ago

There is one more question: if the lists are generated by a script, then what is the point of storing ipv6 addresses as a string?

No specific reasons: it was the simplest implementation...

The sum of the lengths of all 2980 lines with ipv6 addresses is equal to 50386 bytes, and 2980*16 is equal to 47888.

You need to take into account at least one bytes for the prefix length: 2980 * (16 + 1) = 50660 > 50386. So, I don't think we have any space benefits from the binary format. The startup might be faster, though. We might look into that...

ntop / nDPI

Incorrect data in network protocol address lists. #2150

Describe the bug