Open vel21ripn opened 10 months ago
Hi @vel21ripn, thanks for these interesting inputs! We are having some internal discussions about how to improve these lists, and so any feedback is welcomed!
Let' start with the real bug: the overlapping addresses... There are a few different cases:
1) Goto/citrix: we are importing the same list twice! Nice catch. I am going to remove one of them
2) MICROSOFT_AZURE vs MICROSOFT_365: these addresses are present in both the original lists (azure and ms365) explicitly provided by Microsoft itself... not sure what we should do here....
3) We don't have a MODBUS list... There are two logical separated lists in inc/generation: one list with the addresses used for protocol classification (usually used to match server address; FB, Telegram, Whatsapp,...) and one list used for flow risk detection (used to match client address; iCloudPrivateRelay, ProtonVPN exit nodes and crawlers). It is definite possible to have some addresses in both logical list
Another topic: aggregation.
We already have a function (mergeipaddrlist.py
) to aggregate addresses, but we don't use it everywhere. We should improve that...
Thanks for the clarification regarding the MODBUS, iCloudPrivateRelay, ProtonVPN lists.
Information on the benefits of address aggregation. We have 40700 entries for ipv4 and 12397 entries for ipv6. After aggregation, we get 27811 records for ipv4 and 8216 records for ipv6. IMHO aggregation is useful.
Thanks for the clarification regarding the MODBUS, iCloudPrivateRelay, ProtonVPN lists.
Information on the benefits of address aggregation. We have 40700 entries for ipv4 and 12397 entries for ipv6. After aggregation, we get 27811 records for ipv4 and 8216 records for ipv6. IMHO aggregation is useful.
@vel21ripn, could you check if bdb73db1a49d271bfb958eaabcce489013d84f3c fixes the aggregation issue, please?
Very big difference in address lists between commit bdb73db1a49d271bfb958eaabcce489013d84f3c and 6c9571d9a92b8c71bd7b8a565f062a49bd7d4d49. Before this commit there were 40700 ipv4 addresses, but now there are 7679.
The TOR and MULLVAD address list is not aggregated. TOR 1327 -> 896 MULLVAD 643 -> 537
Thank you. Reducing the number of networks by more than 4 times is very good result.
There is one more question: if the lists are generated by a script, then what is the point of storing ipv6 addresses as a string? The sum of the lengths of all 2980 lines with ipv6 addresses is equal to 50386 bytes, and 2980*16 is equal to 47888. So, if we use a binary representation for storage, this will also reduce the required amount of memory and reduce the cost of initializing address lists.
The TOR and MULLVAD address list is not aggregated.
Done in 55664392a9661a3061bc0e1325e354863946814d
There is one more question: if the lists are generated by a script, then what is the point of storing ipv6 addresses as a string?
No specific reasons: it was the simplest implementation...
The sum of the lengths of all 2980 lines with ipv6 addresses is equal to 50386 bytes, and 2980*16 is equal to 47888.
You need to take into account at least one bytes for the prefix length: 2980 * (16 + 1) = 50660 > 50386. So, I don't think we have any space benefits from the binary format. The startup might be faster, though. We might look into that...
Describe the bug
Some networks are described in more than one protocol.
Another problem is the lack of subnet aggregation.
To solve these problems, we need to abandon "include inc_generation/*.c.inc" and switch to automated construction of subnet lists. It also makes sense to abandon separate loading of address lists. We need to make a mask of loaded lists and one list of addresses. If I’m not mistaken, I proposed storing lists of addresses in .yaml files and collecting an optimized list of addresses from them, but for some reason the implementation was not included in nDPI. The format of the file with the list of addresses is not significant. Using '.c' files to store lists of addresses is also a good option.
The only difficulty in solving these problems is the lack of protocol names. I use a non-cross-platform solution in the form of a perl script that generates the necessary data from the ndpi_protocol_ids.h file. I don't know how much this is acceptable in an nDPI project.
I can offer my PR
Ambiguous address:
IPv4
IPv6